π΅π° GEOPAK: A Province Aware Hierarchical Geolocation Framework for Vision Based Localization in Pakistan
GEOPAK is a specialized geographic vision model designed to estimate the precise location (latitude, longitude) of images taken within Pakistan. Unlike global models, GEOPAK handles the specific visual diversity of Pakistan's provinces from the coastal lines of Sindh to the mountainous terrains of Gilgit-Baltistan by leveraging a novel dual-encoder architecture and a province aware geocell classification system.
GEOPAK uses a Dual-Encoder Gated Fusion architecture to combine general object recognition with scene-specific features.
-
Dual Input Encoders:
- CLIP ViT-B/16: Semantic encoder (Frozen) capturing high-level concepts (e.g., "mosque", "mountain").
- ResNet50-Places365: Scene encoder capturing environmental context (e.g., "urban canyon", "glacier").
-
Gated Fusion Mechanism:
- Features from both encoders are projected to 512-dim and fused via a learnable gate that weighs the importance of semantic vs. scene features per image.
-
Hierarchical Classification Heads:
- Province Head: Predicts one of 7 provinces using the fused features.
- Province-Gated Geocell Heads: A Mixture-of-Experts style block where the appropriate head (e.g., Sindh Head) is activated based on the province prediction.
-
Embeddings: To Condition the Offset Head, we use learnable embedding layers:
-
Province Embedding: (
$7 \times 32$ dim) -
Cell Embedding: (
$N_{\text{cells}} \times 96$ dim) - These embeddings are concatenated with the visual features to inform the offset head about the "coarse" location it is refining.
-
Province Embedding: (
-
Critical Precision Head (Offset):
- A regression head that predicts small
$\Delta lat, \Delta lon$ adjustments. It inputs the fused visual vector + Cell Embedding + Province Embedding.
- A regression head that predicts small
-
Auxiliary Coarse Head:
- A secondary regression head connected directly to the fusion layer. It is used only during training to force the encoders to retain global coordinate information (
lat, lon) early in the network, stabilizing the training of the specific heads.
- A secondary regression head connected directly to the fusion layer. It is used only during training to force the encoders to retain global coordinate information (
The dataset is constructed specifically for Pakistan using a targeted regional crawler.
The dataset is a curated collection of geographically diverse images constructed from multiple high-quality sources, specifically targeted to capture the visual variance of Pakistan's landscape.
-
Data Sources:
- Google Places API: High-resolution images of verifyable landmarks, urban centers, and points of interest.
- YF (Yahoo Flickr Creative Commons): A massive dataset of user-uploaded geotagged imagery providing diverse, in-the-wild perspectives.
- Google Landmarks v2: A large-scale benchmark dataset for instance-level recognition and retrieval.
- FlickApi: Integrated crawler specifically for fetching high-resolution, relevant regional imagery from Flickr's API.
-
Processing Pipeline:
- Crawl & Aggregate: Raw images and metadata are aggregated from the source APIs using regional bounding boxes.
- Quality Filtration: Images are filtered to remove low-quality samples, non-geotagged entries, and indoor scenes irrelevant to geographic localization.
- Geocell Construction (Clustering):
- Algorithm: HDBSCAN (Hierarchical Density-Based Spatial Clustering) is used to cluster raw GPS coordinates. Unlike K-Means, it adapts to density variations (dense in Lahore, sparse in Thar).
- Dynamic Balancing: We target a specific number of cells per province based on area and data density.
- Radius Constraints: Clusters are constrained to a physically meaningful radius (e.g., max 50km for rural, 5km for urban) to ensure that knowing the "Cell ID" gives a strong location prior.
The dataset contains a total of 90,515 images (81,462 Train, 9,053 Test). It reflects the natural imbalance of digital data availability in Pakistan.
| Region | Samples | Density | Challenge |
|---|---|---|---|
| Sindh | 65,221 | Very High | High urban density (Karachi), strong coastal features. |
| Punjab | 8,459 | Medium-High | Dense urban usage, agricultural patterns. |
| KPK | 5,344 | Medium | Variegated terrain, moderate density. |
| Islamabad (ICT) | 4,410 | High (Local) | Very high density for small area. |
| Balochistan | 3,627 | Low | Extreme sparsity. Large area with very few geotagged photos. |
| Gilgit-Baltistan | 2,379 | Low-Med | Iconic tourism spots, but sparse non-tourist data. |
| Azad Kashmir | 1,075 | Very Low | Tourism-driven, similar coverage to GB. |
π Availability: The full curated dataset of ~90k images and will be made available soon via HuggingFace.
Checkpoints : https://huggingface.co/HaseebAsif/GEOPAK
β οΈ Limitation: The model may struggle in Balochistan and all areas due to data sparsity of Pakistan .
To improve generalization in sparse data regions, we use a custom augmentation strategy that preserves critical geographic cues.
| β Allowed (Safe) | β Avoid (Biased) |
|---|---|
| Random Crop: Handles scale invariance. | Horizontal Flips: Destroys roads & traffic orientation bias. |
| Color Jitter: Handles variable lighting & time-of-day. | Large Rotations: Destroys horizon/landscape cues. |
| Weather Simulation: Simulates fog, rain, and overcast. | Perspective Warps: Creates unnatural distortions. |
| Slight Blur / Noise: Handles sensor diversity. |
The GEOPAK model is trained on high-performance NVIDIA A100 GPUs powered by Modal, enabling scalable and efficient processing of Pakistan's geographic datasets.
Training is divided into three sequential phases to ensure stability and accuracy.
Initial stage to establish a strong baseline for province classification across Pakistan's diverse regions.
- Module:
model/province - Objective: Train the shared encoders specifically on province identification to capture macro-regional visual cues.
- Command:
python model/province/train_province.py --batch_size 64 --num-epochs 8
Focuses on learning the geographic layout and province-cell hierarchy while keeping the vision system stable.
- Encoders: β Frozen
- Trainable Modules: Cell classifier, Offset heads, Cell embeddings.
-
LR:
$10^{-3}$ - Epochs: 15β25
-
Command:
python model/phase1/train_phase1.py --batch_size 64 --num-epochs 25
Adapts the vision encoders specifically to Pakistan's regional features.
- Encoders: π Unfreeze top 25β30% of layers.
-
LR: Encoder (
$10^{-5}$ ), Heads ($5 \times 10^{-4}$ ) - Epochs: 20β30
-
Command:
python model/phase2/train_phase2.py --load_from_phase1 checkpoints/phase1/best.pt
The model optimizes a multi-component objective function designed to handle both classification accuracy and metric precision.
A Weighted Cross-Entropy Loss is used to predict the correct province
For the correct province, the corresponding geocell head is trained using Kullback-Leibler (KL) Divergence with Distance-Aware Label Smoothing. Instead of a hard one-hot target, the target distribution
Where
The offset head predicts the deviations
The final objective is a weighted sum of these components, with an auxiliary loss
A direct Haversine regression loss applied to the Auxiliary Head. This head tries to predict the global (lat, lon) directly from the fusion embeddings without using cells. This acts as a regularizer, ensuring the shared embeddings contain strong global positioning information.
GEOPAK does not use a simple argmax approach. Instead, it employs a probabilistic mixture to handle spatial ambiguity across borders:
- Province Selection: The model predicts province probabilities and selects the Top-2 provinces.
- Cell Selection: For each selected province, the model selects the Top-K (K=5) geocells.
- Coordinate Refinement: For each of the 10 candidate cells, the model calculates the final coordinate as: $$ \text{pred}{i} = \text{cell center}{i} + \text{offset}_{i} $$
- Weighted Aggregation: The final output is the weighted sum of these hypotheses based on their joint probability: $$ P(i) = P(\text{province} \mid \text{image}) \times P(\text{cell} \mid \text{image, province}) $$ $$ \text{Final LatLon} = \sum P(i) \times \text{pred}_{i} $$
To predict the location of a new image:
from model.phase2.inference_phase2 import GeopakPredictor
# Load Model
predictor = GeopakPredictor(checkpoint_path="checkpoints/phase2/best.pt")
# Predict
img_path = "assets/test_image.jpg"
result = predictor.predict(img_path)
print(f"Province: {result['province']}")
print(f"Location: {result['lat']}, {result['lon']}")
print(f"Confidence: {result['confidence']:.2f}")The primary limitation of this framework is the extreme sparsity of geotagged data available in Pakistan. Despite exhaustive efforts to extract imagery from all possible sourcesβincluding the Google Places API, Flickr API, and custom regional crawlersβthe resulting dataset remains significantly smaller than global benchmarks.
Consequently, the model is often unable to fully learn complex geographic patterns across the varied Pakistani terrain from such a limited sample size. This results in a performance bias toward high-density urban centers like Lahore and Karachi, while the model struggles to generalize in remote, data-scarce regions. While the dual-encoder gated fusion architecture is technically robust, its potential is currently constrained by these data-related gaps. Future work will focus on overcoming these sparsity issues through multi-modal metadata integration and further regional data density expansion.
