Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
uses: actions/checkout@v3
with:
fetch-depth: 0 # otherwise, you will failed to push refs to dest repo
token: ${{ secrets.SPHINX_DOCUMENTATION}}
token: ${{ secrets.GITHUB_TOKEN}}

- name: Set up Python
uses: actions/setup-python@v4
Expand All @@ -41,5 +41,5 @@ jobs:
if: ${{ github.event_name == 'push' }}
uses: ad-m/github-push-action@v0.6.0
with:
github_token: ${{ secrets.SPHINX_DOCUMENTATION }}
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: gh-pages
4 changes: 1 addition & 3 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,11 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.9", "3.10"]
python-version: ["3.10", "3.11"]

# This allows a subsequently queued workflow run to interrupt previous runs
concurrency:
group: "${{ github.workflow }} - ${{ matrix.python-version}} - ${{ matrix.os }} @ ${{ github.ref }}"
cancel-in-progress: true

steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -49,7 +48,6 @@ jobs:
pip install ".[tutorials]"
pip install -r ./docs/tutorials/requirements.txt


- name: Convert and run notebooks
shell: bash
run: |
Expand Down
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
[![github actions pytest](https://github.com/hlasse/textdescriptives/actions/workflows/tests.yml/badge.svg)](https://github.com/hlasse/textdescriptives/actions)
[![github actions docs](https://github.com/hlasse/textdescriptives/actions/workflows/documentation.yml/badge.svg)](https://hlasse.github.io/TextDescriptives/)
[![status](https://joss.theoj.org/papers/06447337ee61969b5a64de484199df24/status.svg)](https://joss.theoj.org/papers/06447337ee61969b5a64de484199df24)
[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://huggingface.co/spaces/HLasse/textdescriptives)

A Python library for calculating a large variety of metrics from text(s) using spaCy v.3 pipeline components and extensions.

Expand Down Expand Up @@ -87,7 +86,6 @@ All the tutorials are located in the `docs/tutorials` folder and can also be fou
| Documentation | |
| -------------------------- | ---------------------------------------------------------------------------------- |
| 📚 **[Getting started]** | Guides and instructions on how to use TextDescriptives and its features. |
| 👩‍💻 **[Demo]** | A live demo of TextDescriptives. |
| 😎 **[Tutorials]** | Detailed tutorials on how to make the most of TextDescriptives |
| 📰 **[News and changelog]** | New additions, changes and version history. |
| 🎛 **[API References]** | The detailed reference for TextDescriptive's API. Including function documentation |
Expand Down
17 changes: 9 additions & 8 deletions docs/coherence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The `coherence` components calculates the coherence of the document, based on
word embedding cosine similarity between sentences.

textdescriptives currently implements first-order and second-order coherence. The
implementation follows e.g. [1] and [2]:
implementation follows e.g. `[1] <https://doi.org/10.1038/npjschz.2015.30>_` and `[2] <https://doi.org/10.1016/j.schres.2022.07.002>_`:
* First-order coherence: The cosine similarity between consecutive sentences.
* Second-order coherence: The cosine similarity between sentences that are two sentences apart.

Expand All @@ -18,17 +18,13 @@ and `overwriting the vector attribute <https://spacy.io/usage/linguistic-feature

The following attributes are added to :code:`Doc` objects.:

* ._.first_order_coherence_values: A list of floats, where each float is the
* :code:`._.first_order_coherence_values`: A list of floats, where each float is the
cosine similarity between consecutive sentences.
* ._.second_order_coherence_values: A list of floats, where each float is the
* :code:`._.second_order_coherence_values`: A list of floats, where each float is the
cosine similarity between sentences that are two sentences apart.
* ._.cohererence: a dict containing the mean coherence values for first and
* :code:`._.coherence`: a dict containing the mean coherence values for first and
second order coherence (keys: "first_order_coherence", "second_order_coherence")

[1] Bedi, G., Carrillo, F., Cecchi, G. A., Slezak, D. F., Sigman, M., Mota, N. B., Ribeiro, S., Javitt, D. C., Copelli, M., & Corcoran, C. M. (2015). Automated analysis of free speech predicts psychosis onset in high-risk youths. Npj Schizophrenia, 1(1), Article 1. https://doi.org/10.1038/npjschz.2015.30

[2] Parola, A., Lin, J. M., Simonsen, A., Bliksted, V., Zhou, Y., Wang, H., Inoue, L., Koelkebeck, K., & Fusaroli, R. (2022). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. Schizophrenia Research. https://doi.org/10.1016/j.schres.2022.07.002


Usage
~~~~~~
Expand Down Expand Up @@ -66,3 +62,8 @@ Component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: textdescriptives.components.coherence.create_coherence_component


[1] Bedi, G., Carrillo, F., Cecchi, G. A., Slezak, D. F., Sigman, M., Mota, N. B., Ribeiro, S., Javitt, D. C., Copelli, M., & Corcoran, C. M. (2015). Automated analysis of free speech predicts psychosis onset in high-risk youths. Npj Schizophrenia, 1(1), Article 1. https://doi.org/10.1038/npjschz.2015.30

[2] Parola, A., Lin, J. M., Simonsen, A., Bliksted, V., Zhou, Y., Wang, H., Inoue, L., Koelkebeck, K., & Fusaroli, R. (2022). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. Schizophrenia Research. https://doi.org/10.1016/j.schres.2022.07.002
6 changes: 3 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,11 +59,11 @@
# -- Options for myst-nb -----------------------------------------------------

# set the timeout for executing notebooks
# nb_execution_timeout = 600 # in seconds, default 30 seconds
nb_execution_timeout = 600 # in seconds, default 30 seconds. 600 seconds = 10 minutes
# nb_execution_timeout = 600 # in seconds, default 30 seconds. 600 seconds = 10 minutes

# always fail CI pipeline when nb cannot be executed
nb_execution_raise_on_error = True
# nb_execution_raise_on_error = True
nb_execution_mode = "off" # for now we will not execute the notebooks during doc build


# -- Options for HTML output -------------------------------------------------
Expand Down
61 changes: 0 additions & 61 deletions docs/faq.rst

This file was deleted.

1 change: 0 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ TextDescriptives

TextDescriptives is Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions.
TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.
If you wish to try out the package, you can use the `online demo <https://huggingface.co/spaces/HLasse/textdescriptives>`__.

The documentation is organized in two parts:

Expand Down
7 changes: 4 additions & 3 deletions docs/information_theory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ Information Theory
The `information_theory` component calculates information theoretic measures derived
from the text. These include:

- `{doc/span}._.entropy`: the Shannon entropy of the text using the `token.prob` as the probability

- :code:`{doc/span}._.entropy`: the Shannon entropy of the text using the `token.prob` as the probability
of each token. Entropy is defined as :math:`H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)`. Where :math:`p(x_i)` is the probability of the token :math:`x_i`.
- `{doc/span}._.perplexity`: the perplexity of the text. Perplexity is a measurement of how well a
- :code:`{doc/span}._.perplexity`: the perplexity of the text. Perplexity is a measurement of how well a
probability distribution or probability model predicts a sample. Perplexity is defined as :math:`PPL(X) = e^{-H(X)}`, where :math:`H(X)` is the entropy of the text.
- `{doc/span}._.per_word_perplexity`: The perplexity of the text, divided by the number of words. Can be considered the length-normalized perplexity.
- :code:`{doc/span}._.per_word_perplexity`: The perplexity of the text, divided by the number of words. Can be considered the length-normalized perplexity.

These information theoretic measures are for example often used to describe the complexity of a text.
The higher the entropy, the more complex the text is.
Expand Down
19 changes: 0 additions & 19 deletions docs/news.rst

This file was deleted.

16 changes: 8 additions & 8 deletions docs/readability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,49 +9,49 @@ The *readability* component adds the following readabiltiy metrics under the :co



* **`Gunning-Fog <https://en.wikipedia.org/wiki/Gunning_fog_index>`__**, is a readability index originally developed for English writing, but works for any language. The index estimates the years of formal education needed to understand the text on a first reading. A Gunning-Fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The formula for calculating the index is:
* `Gunning-Fog <https://en.wikipedia.org/wiki/Gunning_fog_index>`__, is a readability index originally developed for English writing, but works for any language. The index estimates the years of formal education needed to understand the text on a first reading. A Gunning-Fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The formula for calculating the index is:

*Grade level = 0.4 × (ASL + PHW)*

Where *ASL* is the average sentence length (total words / total sentences), and *PHW* is the percentage of hard words (words with three or more syllables).

Note: requires hyphenation.

* **`SMOG <https://en.wikipedia.org/wiki/SMOG>`__**, or Simple Measure of Gobbledygook, is a readability formula that estimates the years of education required to understand a piece of writing. It primarily focuses on the complexity of words, using the number of polysyllabic words in the text. The formula is:
* `SMOG <https://en.wikipedia.org/wiki/SMOG>`__, or Simple Measure of Gobbledygook, is a readability formula that estimates the years of education required to understand a piece of writing. It primarily focuses on the complexity of words, using the number of polysyllabic words in the text. The formula is:

*SMOG Index = 1.043 × √(30 × (hard words / n_sentences)) + 3.1291*

Note: requires hyphenation.

* **`Flesch reading ease <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease>`__**, is a readability score that indicates how easy a text is to read. Higher scores indicate easier reading, while lower scores indicate more difficult reading. The score is calculated using the following formula:
* `Flesch reading ease <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease>`__, is a readability score that indicates how easy a text is to read. Higher scores indicate easier reading, while lower scores indicate more difficult reading. The score is calculated using the following formula:

*Flesch Reading Ease = 206.835 - (1.015 × ASL) - (84.6 × ASW)*

Where *ASL* is the average sentence length and *ASW* is the average number of syllables per word.

Note: requires hyphenation.

* **`Flesch-Kincaid grade <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level>`__**, is a readability metric that estimates the grade level needed to comprehend a text. It is based on the average sentence length and average number of syllables per word. The formula is:
* `Flesch-Kincaid grade <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level>`__, is a readability metric that estimates the grade level needed to comprehend a text. It is based on the average sentence length and average number of syllables per word. The formula is:

*Flesch-Kincaid Grade = 0.39 × (ASL) + 11.8 × (ASW) - 15.59*

Note: requires hyphenation.

* **`Automated readability index <https://en.wikipedia.org/wiki/Automated_readability_index>`__**, is a readability test that calculates an approximate U.S. grade level needed to understand a text. It is based on the average number of characters per word and the average sentence length. The formula is:
* `Automated readability index <https://en.wikipedia.org/wiki/Automated_readability_index>`__, is a readability test that calculates an approximate U.S. grade level needed to understand a text. It is based on the average number of characters per word and the average sentence length. The formula is:

*ARI = 4.71 × (n_chars / n_words) + 0.5 × (n_words / n_sentences) - 21.43*

* **`Coleman-Liau index <https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index>`___**, is a readability test that estimates the U.S. grade level needed to understand a text. It is based on the average number of letters per 100 words and the average number of sentences per 100 words. The original formula is:
* `Coleman-Liau index <https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index>`___, is a readability test that estimates the U.S. grade level needed to understand a text. It is based on the average number of letters per 100 words and the average number of sentences per 100 words. The original formula is:

*CLI = 0.0588 × L - 0.296 × S - 15.8*

Where *L* is the average number of characters per 100 words and *S* is the average number of sentences per 100 words. In our implementation we average over the entire text instead of just 100 words.

* **`Lix <https://en.wikipedia.org/wiki/Lix_(readability_test)>`__**, or Lesbarhetsindex, is a readability measure that calculates a readability score based on the average sentence length and the percentage of long words (more than six characters) in the text. The formula is:
* `Lix <https://en.wikipedia.org/wiki/Lix_(readability_test)>`__, or Lesbarhetsindex, is a readability measure that calculates a readability score based on the average sentence length and the percentage of long words (more than six characters) in the text. The formula is:

*Lix = (n_words / n_sentences) + (n_long_words * 100) / n_words*

* **`Rix <https://www.jstor.org/stable/40031755>`__**, is a readability measure that estimates the difficulty of a text based on the proportion of long words (more than six characters) in the text. The formula is:
* `Rix <https://www.jstor.org/stable/40031755>`__, is a readability measure that estimates the difficulty of a text based on the proportion of long words (more than six characters) in the text. The formula is:

*Rix = (n_long_words / n_sentences)*

Expand Down
Loading
Loading