Skip to content

Wordcloud diagram #10

@BastianMoya

Description

@BastianMoya

I encountered an issue when trying to plot multiple sets in a single Venn/Euler diagram using Diagram.as_wordcloud. The problem occurs when one or more regions of the diagram contain no words. In these cases, the plot fails.

For example, the following setup does NOT work when some diagram areas are empty:

text_v = """word1 word2 non"""
text_f = """word3 word4 non"""
text_d = """word5 word3 non word1 word1 word4 word6"""

def word_tokenize(text):
    """Break a string into its constituent words and convert them into tokens."""
    words = text.split(' ')
    words = [''.join(ch for ch in word if ch.isalnum()) for word in words]
    return words

sets = [set(word_tokenize(text)) for text in [text_v, text_f, text_d]]

VennDiagram.as_wordcloud(sets)

Proposed solution

I believe I’ve found a workaround that resolves the issue with empty regions. I’m sharing it here in case it’s useful and could potentially be included or improved by the developers.

  1. Handling empty wordcloud regions (wordcloud.py)
    I modified the following code:
if len(frequencies) <= 0:
    raise ValueError(
        "We need at least 1 word to plot a word cloud, got %d."
        % len(frequencies)
    )

to:

frequencies = frequencies[:self.max_words]

# largest entry will be 1
if len(frequencies) <= 0:
    frequencies = [(" ", 1)]
else:
    max_frequency = float(frequencies[0][1])
    frequencies = [(word, freq / max_frequency)
                   for word, freq in frequencies]

This change prevents errors in empty regions, allowing the diagram to be plotted successfully.

I also encountered an issue with the to_array method:

def to_array(self, copy=None):
    return np.asarray(self.to_image(), copy=copy)

I resolved it by changing the last line to:

return np.asarray(self.to_image())
  1. Supporting MultiPolygon geometries (_diagram_classes.py)

Finally, I modified the mask creation logic to support both Polygon and MultiPolygon geometries.

Original code:

path = Path(geometry.exterior.coords)
mask = path.contains_points(XY).reshape((height_in_pixel, width_in_pixel))
mask = np.flipud(mask)
subset_masks[subset_id] = mask

Updated version:

if isinstance(geometry, ShapelyPolygon):
    path = Path(geometry.exterior.coords)
    mask = path.contains_points(XY).reshape((height_in_pixel, width_in_pixel))
    mask = np.flipud(mask)
    subset_masks[subset_id] = mask

elif isinstance(geometry, ShapelyMultiPolygon):
    mask = np.zeros(XY.shape[0], dtype=bool)
    for geom in geometry.geoms:
        path = Path(geom.exterior.coords)
        mask |= path.contains_points(XY)
    mask = mask.reshape((height_in_pixel, width_in_pixel))
    subset_masks[subset_id] = mask

Hopefully this helps someone else facing the same issue.
Happy to open a PR or adjust the solution if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions