[Filter] Duplicate Document Filter

Document has paragraphs, paragraphs has two counts: exact count of paragraphs in the corpus, count of similar paragraphs in the corpus.

We want to drop very duplicate paragraphs (individually) and whole documents (probabilistically) if they contain all (or mostly all) duplicate paragraphs