Skip to content

Better department-tagging system #13

@LukeTheWalker

Description

@LukeTheWalker

At the moment tagging of posts is made by applying various regex to the main body of the message, this approach is obviously limited by the content of the message, opposing to what is actually present in the document; furthermore some of the regex are kinda broken and needs fixing.

In order to improve the tagging system to something more useful it will be necessary to:

  • Download pdfs in a temporary folder
  • Extract the text from the pdf (Text Extractor or OCR)
  • Finer regex rules

Some examples of broken regex are:
...4 “Istruzione e Ricerca” - Componente 2 “Dalla ricerca all’impresa”- progetti... [classified as DEI because of "impresa"]
...15/02/2023 - Il dott. Gioacchino Alex ANASTASI, nato nel 1991.... [classified as GIUR because of "lex"]
...in Automation Engineering and Control of Complex Systems (LM-25).... [classified as GIUR because of "lex"]

In general DEI and GIUR department are somewhat broken because they contain common use terms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions