This repository contains the Spanish ELTec (European Literary Text Collection) corpus and related analysis. The project focuses on sentiment analysis of literary texts from Spain, specifically examining how different geographic locations are portrayed in Spanish literature from different time periods.
The Spanish ELTeC subcorpus consists of 87 XML-encoded literary texts from different time periods (1840-1920). Each text includes rich metadata such as author information, publication year, and word count.
The folder contains 2 kind of files:
- XML original ELTec files: Contain the original Spanish ELTec level 1 data
- Annotated files: Contain the annotated version of the corpus using spaCy, including part-of-speech tags and named entities. This type of files end with the suffix "_annotated.xml"
The experiment files are the jupyter notebooks named "SentimentAnalysis.ipynb" and "All_instances.ipynb".
-
The
SentimentAnalysis.ipynbnotebook was used to conduct experiments using only 30 of the instances that obtained the highest confidence score when analyzing the sentiment of the text window. -
The
All_Instancesanalysis includes results from all instances that mentioned one of the locations of interest. This was done in order to analyze how many times foreign location were mentioned across the corpus, not taking into account the confidence score of the sentiment classification.
-
Text Processing:
- The original ELTec files were processed using spaCy to extract the text and metadata.
- Applied POS and NER tagging using spaCy
- Enhanced with "mrm8488/bert-spanish-cased-finetuned-ner" for improved entity recognition
- Filtered for geographic locations of interest: America, Cuba, Philippines, Egypt, and Asia
- Resulting files: annotated versions and the files within the "Name Entity Extraction" folder.
- entities_frequency.csv: shows the results using spaCY.
- filtered_entities.csv: shows the results using the transformer model.
-
Sentiment Analysis:
- Used tabularisai/multilingual-sentiment-analysis model
- Extracted 100-token windows around each geographic mention
- Analyzed sentiment with five categories, later collapsed to Positive/Negative
- Retained top 30 results by confidence score for analysis
- Resulting files:
- sentiment.csv: shows the results using the transformer model.
- all_countries.csv: shows the results of the analysis using all the instances of the corpus (not only the ones with the highest confidence score).
-
Visualization:
- Created 40 individual text files with semantically meaningful words
- Applied TF-IDF filtering to remove common, uninformative terms
- Generated word clouds for each location-sentiment-period combination
results/directory contains:- Processed text windows (
.txtfiles) - Word clouds visualization in the "WordCloud" folder
- Processed text windows (
Name Entity Extraction/directory contains:- entities_frequency.csv: shows the results using spaCY.
- filtered_entities.csv: shows the results using the transformer model.
data/directory contains:- sentiment.csv: shows the results using the transformer model.
- all_countries.csv: shows the results of the analysis using all the instances of the corpus.
To reproduce the analysis:
- Navigate to the
SentimentAnalysis.ipynbnotebook - Follow the instructions in the notebook to process the data and generate results
- Python 3.x
- Jupyter Notebook
- Required Python packages (see
requirements.txt):- spaCy
- Transformers
- WordCloud
- NLTK
- Pandas
- Matplotlib