Skip to content

Migration-TR: Turkish migration discourse dataset with 6M tweets (2011-2022), AI models for perception-attitude classification, and bot detection tools.

License

Notifications You must be signed in to change notification settings

cssturkiye/migration-tr

Repository files navigation

Migration-TR: Turkish Migration Discourse Dataset 🐦🇹🇷

Paper Dataset Access Interactive Plots

Migration Discourses on X.com: Analysis of Public Perceptions and Attitudes Toward Refugees in Turkey Using Natural Language Processing

Migration-TR Banner

Evrim Yılmaz Polat* and Evrim Çağın Polat
Department of Sociology, Zonguldak Bülent Ecevit University, Zonguldak, Turkey;
Notrino Research, ODTÜ Teknokent, Ankara, Turkey

Computational Social Sciences Turkey (CSSTR) - Computational Social Sciences Working Group


🎯 Overview

This repository contains the Migration-TR dataset and accompanying AI models for analyzing migration discourse in Turkish social media. Our research analyzes 6 million tweets collected between 2011-2022 using the Twitter Academic API, focusing on public perceptions and attitudes toward migrants and refugees in Turkey.

Key Highlights

Component Details
Dataset 5,884,624 raw tweets → 3,814,679 regular (human-authored) tweets
Time Span 12-year temporal analysis (2011-2022)
Classification 8-class granular model (F1: 0.768)
Bot Detection XGBoost model (F1: 0.832) for data enrichment
Visualizations Interactive time-series charts via Plotly

🤖 AI Models

We provide two perception-attitude classification models based on Intergroup Threat Theory:

Model Schema Classes Macro F1 Use Case
Granular all-classes 8 0.768 ✅ Recommended for applications
Super-Class super-classes 6 0.801 🧪 Experimental

Architecture

  • Base Model: VRLLab/TurkishBERTweet (894M tweets pre-trained)
  • Fine-tuning: LoRA adapters
  • Training Data: 15,000 manually annotated tweets

Classification Labels

Both models classify tweets into perception-attitude categories based on Intergroup Threat Theory:

Class Labels Description Theoretical Basis
Sympathy Positive attitudes toward migrants Counter-frame
Neutral Neutral/informational content No threat frame
Antipathy: Economic Threat Fiscal burden, opposition to aid Realistic threat
Antipathy: Employment Threat Jobs/wages competition Realistic threat
Antipathy: Security Threat Crime, violence, border concerns Realistic threat
Antipathy: Identity Threat Cultural imposition, demographic change Symbolic threat
Antipathy: Political Threat Naturalization/voting as threat Realistic threat
Antipathy: Other Generalized hostility Generalized threat

Bot Detection Model

Attribute Value
Architecture XGBoost (ONNX format)
Features 17 user behavior and profile characteristics
Performance F1 = 0.832
Purpose Enrich dataset with bot likelihood scores

📊 Dataset

Attribute Value
Total Tweets 5,884,624 (raw) → 3,814,679 (regular subset)
Time Period January 1, 2011 - December 31, 2022
Language Turkish
Data Source Twitter Academic API
Processing Cleaned, deduplicated, enriched with bot/duplicate flags
📋 Click to view complete data schema (26 fields)

Important: Fields marked ❌ Confidential are retained only for internal compliance and are not distributed.

Field Type Description Availability
created_at datetime Tweet creation timestamp ✅ Available
tweet_location string Geographic location (if available) ✅ Available
text string Tweet content (Turkish) ✅ Available
retweets int Number of retweets ✅ Available
replies int Number of replies ✅ Available
likes int Number of likes ✅ Available
quote_count int Number of quote tweets ✅ Available
author_id string Anonymized author identifier ❌ Confidential
username string Author username ❌ Confidential
name string Author display name ❌ Confidential
author_pic string Profile picture URL ❌ Confidential
author_followers int Follower count ❌ Confidential
author_listed int Listed count ❌ Confidential
author_following int Following count ❌ Confidential
author_tweets int Total tweet count ❌ Confidential
author_protected boolean Protected account status ❌ Confidential
author_entities json Profile entities ❌ Confidential
author_description string Profile bio ❌ Confidential
author_verified boolean Verification status ❌ Confidential
author_created_at datetime Account creation date ❌ Confidential
author_withheld string Withheld status ❌ Confidential
author_location string Author location ❌ Confidential
is_duplicate boolean Exact duplicate flag ✅ Available
bot_prob float Bot probability score (0-1) ✅ Available
is_bot boolean Bot likelihood flag ✅ Available
all_classes_results json AI model predictions ✅ Available

🚀 Quick Start

Perception-Attitude Classification

# Install dependencies
pip install -r requirements.txt

# 8-class Granular model (recommended)
python run_inference.py --text "Mültecilere vatandaşlık verilmesin" --model-type all-classes

# 6-class Super-Class model (experimental)
python run_inference.py --text "Mültecilere vatandaşlık verilmesin" --model-type super-classes

# Run on CPU
python run_inference.py --text "Mültecilere vatandaşlık verilmesin" --device cpu

Bot Detection

python run_bot_detection.py --features example_user_data.json

📈 Interactive Visualizations

Explore temporal dynamics of migration discourse: View Interactive Charts

  • 📊 Pan, zoom, and hover for detailed data points
  • 📅 12-year temporal coverage (2011-2022)
  • 💾 Export charts as PNG

📂 Repository Structure

Migration-TR/
├── trained_models/
│   ├── perception_attitude_clf_super_classes/   # 6-class model weights
│   ├── perception_attitude_all_classes/         # 8-class model weights
│   └── bot_clf/                                 # Bot detection model
├── docs/                                        # GitHub Pages site
│   ├── index.html                               # Main visualization page
│   └── assets/plots/                            # Interactive Plotly charts
├── run_inference.py                             # Classification inference script
├── run_bot_detection.py                         # Bot detection script
├── example_user_data.json                       # Sample bot detection input
├── requirements.txt                             # Python dependencies
└── DATA_USE_AGREEMENT.md                        # Data use agreement

📝 Data Access

🔐 Access Requirements

Who Can Access:

  • Academic Researchers at accredited institutions
  • Graduate Students with supervisor approval
  • Policy Researchers at recognized organizations
  • Non-commercial use only - no commercial applications

Not Permitted:

  • Commercial use or monetization
  • Surveillance or tracking applications
  • Attempts to re-identify users
  • Redistribution of raw tweet text

📋 Access Process

Step 1: Review Data Use Agreement

Read our comprehensive Data Use Agreement carefully.

Step 2: Submit Request

Email your signed DUA to: info@csstr.org

Include:

  • Your institutional affiliation
  • Research purpose and methodology
  • Specific data requirements (which specific data chunk you need: From Chunk-1 to Chunk-11770)
  • Supervisor information (for students)

Step 3: Approval & Delivery

  • We review within 5 business days
  • Approved users receive secure download links
  • Data delivered as password-protected archives
  • Manual delivery: maximum 500 hydrated objects per recipient per day (non-automated delivery only)

⚠️ Important Disclaimer

Data Delivery Policy: Due to X.com (formerly Twitter) Developer Policy requirements, we manually deliver:

  • Maximum 500 hydrated tweets per recipient per day (non-automated delivery via email/SFTP)
  • Multiple researchers can receive data simultaneously (500 objects per person per day)
  • Academic use only
  • No public redistribution of full tweet text allowed
  • 24-hour deletion compliance: CSSTR monitors X Compliance API and will inform recipients; you must delete or mask affected tweets within 24 hours

Legal Framework: This dataset complies with:

  • X.com Developer Agreement (current version)
  • Turkish data protection laws
  • GDPR requirements for research
  • Academic research ethics standards

📖 Citation

If you use Migration-TR in your research, please cite:

@article{yilmazpolat_migration_2025,
  title={Migration Discourses on X.com: Analysis of Public Perceptions and 
         Attitudes Toward Refugees in Turkey Using Natural 
         Language Processing},
  author={Yılmaz Polat, Evrim and Çağın Polat, Evrim},
  journal={[Under Review]},
  year={2025},
  note={Dataset available at: https://github.com/cssturkiye/Migration-TR}
}

Paper Status: Currently under peer review. Citation will be updated upon acceptance.


🙏 Acknowledgments

Our classification model is built upon TurkishBERTweet by VRLLab (Najafi & Varol, 2024). We thank the authors for making their work available.


🤝 Contact

Evrim Yılmaz Polat, PhD - Corresponding Author
Department of Sociology, Zonguldak Bülent Ecevit University

Evrim Çağın Polat - Co-Author
Notrino Research, ODTÜ Teknokent, Ankara

📧 Email: info@csstr.org
🏛️ Organization: Computational Social Sciences Turkey (CSSTR)


Migration-TR | Advancing Migration Research Through Computational Social Science

About

Migration-TR: Turkish migration discourse dataset with 6M tweets (2011-2022), AI models for perception-attitude classification, and bot detection tools.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages