-
Notifications
You must be signed in to change notification settings - Fork 1
Description
At the moment tagging of posts is made by applying various regex to the main body of the message, this approach is obviously limited by the content of the message, opposing to what is actually present in the document; furthermore some of the regex are kinda broken and needs fixing.
In order to improve the tagging system to something more useful it will be necessary to:
- Download pdfs in a temporary folder
- Extract the text from the pdf (Text Extractor or OCR)
- Finer regex rules
Some examples of broken regex are:
...4 “Istruzione e Ricerca” - Componente 2 “Dalla ricerca all’impresa”- progetti... [classified as DEI because of "impresa"]
...15/02/2023 - Il dott. Gioacchino Alex ANASTASI, nato nel 1991.... [classified as GIUR because of "lex"]
...in Automation Engineering and Control of Complex Systems (LM-25).... [classified as GIUR because of "lex"]
In general DEI and GIUR department are somewhat broken because they contain common use terms.