Vidal data mining is a project based on data scrapping and mining using Python and Unitex. The idea of the project is to extract drugs (medics) information from VIDAL website
then match each drug with its every possible prescriptions based on a big medical corpus data file corpus-medical.txt that contains information about a medical corpus and a history its doctors visits reports.
Notes:
- Vidal website contains information about medications and parapharmacy products.
- The project is based on FRENCH language
Vidal website:
Content of corpus-medical.txt:
- Execute the python script
scrapper.pyto extract drugs substances from VIDAL Website-
scrapper.py letter1-letter2
-
letter1-letter2argument represent the range of characters. For example'A-Z'- The script generates two files
Subst.dicandinfo.txt subst.diccontains all substances extracted with theUnitexdictionary suffix added to match theUnitexdictionary format.dic
info.txtcontains extraction statistics. The number of substance by each letter and the total number of the extracted substances
- Execute the python script
enrch.pyto enrich our collected substance dictionary. The script scrapes new substances from the filecorpus-medical.txtand add them into new dictionarysubst_enri.dic. Also, it will delete duplicated occurrences and sort the substances in both filessubst.dic&subst_enri.dic
- Open
Unitexand useFRENCHas language. - Move the files
subset_enri.dicsubset.dicto the path ofUnitexDELA folder located in User's documents folder- Example of my path:
D:\Users\Asus\Documents\Unitex-GramLab\Unitex\French\Dela
- Example of my path:
- Apply preprocessing & lexical parsing to
corpus-medical.txt - Open
subset_enri.dicin DELA and compress the dictionary into FST. Two filessubst_enri.binandsubs_enri.infshould be generated as in DELA folder - Apply the same steps for
subset.dic - Open
projetpy.grfin FSgraph to visualize extraction graphs schemasprojetpy.grfrepresents the main graph that consists of 3 graphs (3 possible matchs):
projetpy1.grf
projetpy2.grf
projetpy3.grf
- Notes:
<n+subst>matches a dictionary word. In our case, it is thedrugname scrapped earlier (subst_enri.dicandsubst.dic)<MOT>matches a word match like\win regular expressions<NB>matches a number like\din regular expressions- For more information to understand the graph syntax please refer to Unitex documentation
- Apply lexical ressources to the preprocessed text previously
- Select
subst_enri.binandsubst.binin user ressources anddela.frin system ressources - The final step consists of locating patterns and building concordances:
-
Chose locate pattern
-
Select the
projetpy.grfgraph -
Select
all_matchesand merge with output text -
Index all occurrences in text
-
Build concordance to visualize the results
The results are stored in
corpus-medical_snt\concord.htmlfile located in the same folder ofcorpus-medical.txtUse a web browser for better formatting
-





