A production-ready collection of 14 professionally documented NLP scripts covering fundamental to advanced techniques
Features • Installation • Scripts • Usage • Contributing
- Overview
- Key Features
- Prerequisites
- Installation
- Script Catalog
- Usage Examples
- Technical Architecture
- Project Structure
- Learning Path
- Contributing
- License
- Contact
This repository contains a comprehensive, production-ready collection of Natural Language Processing (NLP) scripts designed for both learning and practical application. Each script is meticulously documented with:
- ✅ Executive summaries explaining the "what" and "why"
- ✅ Step-by-step code walkthroughs with inline comments
- ✅ Real-world applications and use cases
- ✅ Technical deep-dives into algorithms and mathematics
- ✅ Best practices for NLP pipelines
Perfect for:
- 🎓 Students learning NLP fundamentals
- 💼 Data Scientists preparing for interviews (FAANG-level)
- 🔬 Researchers exploring NLP techniques
- 👨💻 Developers building text processing applications
- Text Preprocessing: Regex, tokenization, stemming, lemmatization
- Feature Engineering: Bag of Words (BoW), TF-IDF, N-Grams
- Machine Learning: Text classification, spam detection, sentiment analysis
- Deep Learning: Word embeddings (Word2Vec), neural networks with TensorFlow
- Advanced NLP: Text summarization, Named Entity Recognition (NER), POS tagging
- Every script includes detailed comments explaining logic and rationale
- Mathematical formulas and algorithmic explanations
- Comparison of different approaches (e.g., stemming vs. lemmatization)
- Performance considerations and optimization tips
- Clean, modular, and reusable code structure
- Error handling and edge case management
- Efficient implementations using industry-standard libraries
- Ready for integration into larger projects
- Basic Python programming (functions, loops, data structures)
- Understanding of machine learning concepts (optional but helpful)
- Familiarity with command line/terminal
- Python: 3.7 or higher
- RAM: Minimum 4GB (8GB recommended for deep learning scripts)
- Storage: ~500MB for libraries and datasets
git clone https://github.com/imdataScientistSachin/NLP.git
cd NLP# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtimport nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')nltk==3.8.1
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tensorflow==2.13.0
beautifulsoup4==4.12.2
gensim==4.3.1
imbalanced-learn==0.11.0
matplotlib==3.7.2Purpose: Master Python's re module for pattern matching and text manipulation
Key Concepts:
- Substitution with
re.sub() - Pattern matching with
re.search() - Metacharacters:
\d,\D,\w,\W,\s,\S - Anchors:
^(start),$(end) - Character classes:
[a-z],[^rw]
Use Cases: Data cleaning, text anonymization, input validation
Purpose: Deep dive into complex regex patterns for real-world text processing
Highlights:
- Email and URL extraction
- Phone number validation
- HTML tag removal
- Advanced pattern matching
Purpose: Comprehensive introduction to Natural Language Toolkit (NLTK)
Techniques Covered:
- Tokenization: Sentence and word-level splitting
- Stemming: Reducing words to root form (PorterStemmer)
- Lemmatization: Dictionary-based word normalization
- Stopword Removal: Filtering common words
- POS Tagging: Part-of-speech identification
- Named Entity Recognition (NER): Extracting entities (people, places, organizations)
Real-World Application: Building preprocessing pipelines for text classification
Purpose: Convert text into numerical features using word frequency
Algorithm:
- Tokenize text into words
- Build vocabulary of unique words
- Create feature vectors based on word presence/frequency
- Generate sparse matrix representation
Mathematical Foundation:
Vector(document) = [count(word1), count(word2), ..., count(wordN)]
Limitations: Ignores word order and semantics
Purpose: Advanced text vectorization weighing word importance
Algorithm:
- TF (Term Frequency):
TF(t,d) = count(t in d) / total_words(d) - IDF (Inverse Document Frequency):
IDF(t) = log(N / df(t)) - TF-IDF:
TF-IDF(t,d) = TF(t,d) × IDF(t)
Advantages over BoW:
- Down-weights common words (e.g., "the", "is")
- Up-weights rare, distinctive terms
- Better for document similarity and classification
Implementation: From-scratch implementation + scikit-learn comparison
Purpose: Statistical language modeling and text generation
Concept:
- Unigram (N=1): Single words
- Bigram (N=2): Two-word sequences
- Trigram (N=3): Three-word sequences
Algorithm:
- Build dictionary of N-word sequences → next word
- Use Markov assumption (next word depends only on previous N-1 words)
- Generate text by probabilistically selecting next words
Applications: Autocomplete, text generation, speech recognition
Limitations: Data sparsity, no long-range dependencies
Purpose: End-to-end text classification with model persistence
Pipeline:
- Text preprocessing (cleaning, tokenization)
- Feature extraction (TF-IDF)
- Model training (Naive Bayes, SVM, Random Forest)
- Model serialization with
pickle - Prediction on new data
Key Techniques:
- Train-test split
- Cross-validation
- Hyperparameter tuning
- Model evaluation (accuracy, precision, recall, F1-score)
Purpose: Building reusable ML pipelines for text processing
Benefits:
- Encapsulates preprocessing + model training
- Prevents data leakage
- Simplifies cross-validation
- Easy deployment
Example Pipeline:
Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])Purpose: Automatic summarization using frequency-based ranking
Algorithm:
- Web Scraping: Fetch article using BeautifulSoup
- Preprocessing: Remove citations, clean text
- Word Frequency: Calculate normalized word counts
- Sentence Scoring: Rank sentences by sum of word frequencies
- Selection: Extract top N sentences
Approach: Extractive (selects existing sentences, no generation)
Use Cases: News summarization, document digests, content curation
Purpose: SMS spam detection with class imbalance solutions
Challenge: Imbalanced datasets (95% ham, 5% spam) → biased models
Solutions Implemented:
- ADASYN: Adaptive Synthetic Sampling (focuses on hard-to-learn examples)
- SMOTE: Synthetic Minority Over-sampling Technique
Pipeline:
Text → Custom Preprocessing → BoW → TF-IDF → ADASYN/SMOTE → Random Forest
Evaluation Metrics:
- Confusion Matrix
- Precision, Recall, F1-Score
- ROC-AUC
Purpose: Classify stock market news sentiment
Domain: Financial NLP Techniques: Sentiment analysis, domain-specific preprocessing Applications: Algorithmic trading, market sentiment analysis
Purpose: Prepare text data for neural networks using TensorFlow/Keras
Key Techniques:
- Tokenization: Convert text to integer sequences
- Padding: Ensure uniform sequence length
- Vocabulary Management: Handle out-of-vocabulary (OOV) words
- Embedding Preparation: Set up for embedding layers
Configuration:
max_length: Maximum sequence lengthvocab_size: Vocabulary sizeoov_token: Token for unknown wordspadding_type: 'pre' or 'post'truncating_type: 'pre' or 'post'
Purpose: Learn dense vector representations of words
Concept: Words with similar meanings have similar vector representations
Algorithm: Word2Vec (Skip-Gram model)
- Predicts context words given a target word
- Learns embeddings through neural network training
Mathematical Magic:
king - man + woman ≈ queen
Implementation:
- Web scraping (Wikipedia article)
- Text preprocessing
- Sentence tokenization
- Stopword removal
- Word2Vec training with Gensim
- Similarity queries
Parameters:
vector_size: Embedding dimensionality (e.g., 100)window: Context window sizemin_count: Minimum word frequencyworkers: Parallel processing threads
Applications: Semantic search, document similarity, transfer learning
Purpose: Binary classification using neural networks
Architecture:
Input → Embedding → GlobalAveragePooling1D → Dense(24, ReLU) → Dense(1, Sigmoid)
Dataset: JSON format with sarcastic/non-sarcastic headlines
Training:
- Loss: Binary Crossentropy
- Optimizer: Adam
- Metrics: Accuracy
- Epochs: 30
Visualization: Training/validation accuracy and loss curves
Key Insight: GlobalAveragePooling1D averages embeddings across sequence length, creating fixed-size representation
import nltk
import re
# Load script
from NLTK_03_April import *
# Your text
text = "Hello! This is a sample text for NLP preprocessing."
# Tokenize
sentences = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)
# Lemmatize
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in words]
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in lemmatized if w.lower() not in stop_words]
print(filtered)from sklearn.feature_extraction.text import TfidfVectorizer
# Documents
docs = [
"Machine learning is amazing",
"Deep learning is a subset of machine learning",
"Natural language processing uses machine learning"
]
# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)
# View feature names
print(vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
# Build pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())
])
# Train
pipeline.fit(X_train, y_train)
# Predict
predictions = pipeline.predict(X_test)Raw Text
↓
[Preprocessing]
├── Regex Cleaning (01, 02)
├── Tokenization (03)
├── Normalization (03)
└── Stopword Removal (03)
↓
[Feature Engineering]
├── BoW (04)
├── TF-IDF (05)
├── N-Grams (06)
└── Word Embeddings (13)
↓
[Machine Learning]
├── Classical ML (07, 08, 10, 11)
└── Deep Learning (12, 14)
↓
[Applications]
├── Classification
├── Summarization (09)
└── Text Generation (06)
| Layer | Technologies |
|---|---|
| Core Language | Python 3.7+ |
| NLP Libraries | NLTK, Gensim, spaCy |
| ML Frameworks | Scikit-Learn, Imbalanced-Learn |
| Deep Learning | TensorFlow 2.x, Keras |
| Data Processing | NumPy, Pandas |
| Web Scraping | BeautifulSoup4, urllib |
| Visualization | Matplotlib |
NLP/
│
├── 01_reModule.py # Regular expressions fundamentals
├── 02_regex.py # Advanced regex patterns
├── 03_April_NLTK.py # NLTK comprehensive tutorial
├── 04_BOW.py # Bag of Words implementation
├── 05_TF_IDF.py # TF-IDF from scratch
├── 06_NGram.py # N-Gram language models
├── 07_textClassification_pickle.py # Text classification + model persistence
├── 08_Pipeline_LoadFiles.py # Scikit-Learn pipelines
├── 09_summarization.py # Extractive text summarization
├── 10_smsClassification.py # SMS spam detection (imbalanced data)
├── 11_Stock_Classification.py # Financial sentiment analysis
├── 12_preProcessing_NLP_DL_TF.py # Deep learning preprocessing
├── 13_gensim_word_Embedding.py # Word2Vec embeddings
├── 14_Sarcastic_json_IMP.py # Sarcasm detection (deep learning)
├── README.md # This file
└── requirements.txt # Python dependencies
- Start with
01_reModule.py- Learn regex basics - Move to
03_April_NLTK.py- Master NLTK fundamentals - Understand
04_BOW.py- Grasp text vectorization
- Study
05_TF_IDF.py- Advanced feature engineering - Explore
06_NGram.py- Language modeling - Practice
07_textClassification_pickle.py- Build classifiers
- Master
08_Pipeline_LoadFiles.py- Production pipelines - Implement
09_summarization.py- Real-world application - Tackle
10_smsClassification.py- Handle imbalanced data
- Deep dive into
13_gensim_word_Embedding.py- Word embeddings - Build
14_Sarcastic_json_IMP.py- Neural networks - Integrate
12_preProcessing_NLP_DL_TF.py- DL preprocessing
Contributions are welcome! Here's how you can help:
- 🐛 Report bugs or issues
- 💡 Suggest new features or scripts
- 📝 Improve documentation
- 🔧 Submit pull requests
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Follow PEP 8 style guide
- Add comprehensive comments
- Include docstrings for functions
- Write unit tests for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- ✅ Commercial use
- ✅ Modification
- ✅ Distribution
- ✅ Private use
- ❌ Liability
- ❌ Warranty
Your Name
- 📧 Email: ImdataScientistSachin@gmail.com
- 💼 LinkedIn: www.linkedin.com/in/sachin-paunikar-datascientists
- 🐙 GitHub: [https://github.com/ImdataScientistSachin/-NLP-Tutorial-Collection]
- NLTK Team - For the comprehensive NLP toolkit
- Scikit-Learn Contributors - For robust ML algorithms
- TensorFlow Team - For deep learning framework
- Gensim Developers - For word embedding implementations
- Stack Overflow Community - For invaluable debugging help