Data Preprocessing Techniques

1 Data Cleaning

Clean text data by removing noise and unnecessary elements

import neattext.functions as nfx

def get_clean_text(df):
    # Remove various text artifacts
    df['text'] = df['text'].apply(nfx.remove_userhandles)
    df['text'] = df['text'].apply(nfx.remove_punctuations)
    df['text'] = df['text'].apply(nfx.remove_emojis)
    df['text'] = df['text'].apply(nfx.remove_hashtags)
    df['text'] = df['text'].apply(nfx.remove_html_tags)
    df['text'] = df['text'].apply(nfx.remove_stopwords)
    df['text'] = df['text'].apply(nfx.remove_urls)
    df['text'] = df['text'].apply(nfx.remove_phone_numbers)
    return df

2 Tokenization

Convert text into tokens that models can process

from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "path/to/tokenizer", 
    do_lower_case=True
)

def tokenize(comment):
    encoded_dict = tokenizer.encode_plus(
        str(comment),
        add_special_tokens=True,
        max_length=512,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt',
        truncation=True
    )
    
    return {
        "input_ids": encoded_dict['input_ids'],
        "mask": encoded_dict['attention_mask'],
        "token_type_ids": encoded_dict['token_type_ids']
    }

Model Initialization

Text Classification Model

Define a custom BERT-based classification model

from transformers import AutoModel
import torch.nn as nn

class TextClassification(nn.Module):
    def __init__(self, n_classes, dropout, model_ckpt):
        super(TextClassification, self).__init__()
        self.bert = AutoModel.from_pretrained(model_ckpt)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(768, n_classes)
    
    def forward(self, ids, mask, token_type_ids):
        pooledOut = self.bert(
            ids, 
            attention_mask=mask,
            token_type_ids=token_type_ids
        )
        dropOut = self.dropout(pooledOut[1])
        output = self.out(dropOut)
        return output

# Initialize model
model = TextClassification(
    n_classes=5, 
    dropout=0.9, 
    model_ckpt="bert-base-uncased"
)

Prediction Function

Generate predictions from the trained model

def Classification(ids, mask, type_ids, device):
    # Move tensors to device
    ids = ids.to(device, dtype=torch.long)
    mask = mask.to(device, dtype=torch.long)
    token_type_ids = type_ids.to(device, dtype=torch.long)
    
    # Get model predictions
    outputs = model(
        ids=ids, 
        mask=mask, 
        token_type_ids=token_type_ids
    )
    
    # Apply sigmoid and convert to numpy
    result = torch.sigmoid(outputs)
    result = result.cpu().data.numpy()
    scores = result.tolist()
    
    return scores

Machine Learning Fundamentals

1 Supervised Learning

Learn from labeled training data to make predictions

Common Algorithms:
  • Linear Regression: Predicting continuous values
  • Logistic Regression: Binary classification problems
  • Decision Trees: Tree-based decision making
  • Random Forests: Ensemble of decision trees
  • Support Vector Machines: Finding optimal hyperplanes

2 Unsupervised Learning

Discover patterns in unlabeled data

Common Techniques:
  • K-Means Clustering: Grouping similar data points
  • Hierarchical Clustering: Creating cluster dendrograms
  • PCA: Dimensionality reduction technique
  • Autoencoders: Neural network-based feature learning

3 Model Evaluation

Assess model performance using various metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Deep Learning

1 Neural Network Basics

Understanding the building blocks of deep learning

import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create model instance
model = SimpleNN(input_size=784, hidden_size=128, output_size=10)

2 Convolutional Neural Networks (CNN)

Specialized networks for image processing

Key Components:
  • Convolutional Layers: Extract spatial features
  • Pooling Layers: Reduce spatial dimensions
  • Fully Connected Layers: Final classification
  • Activation Functions: ReLU, LeakyReLU, etc.

3 Recurrent Neural Networks (RNN)

Networks for sequential data processing

Variants:
  • LSTM: Long Short-Term Memory networks
  • GRU: Gated Recurrent Units
  • Bidirectional RNN: Process sequences in both directions

From Classical Machine Learning to Agentic LLMs

1 Why This Matters

Why modern AI is a systems problem, not just a modeling problem

Core Motivation:
  • Beyond Accuracy: Real-world AI must reason, adapt, and interact with tools
  • System-Level Thinking: Performance depends on architecture, data, and evaluation
  • Reliability: AI must behave robustly under uncertainty and distribution shifts
  • Alignment: Systems must remain controllable and human-centered

2 Classical Machine Learning Still Matters

Why foundational ML remains essential in LLM-centric systems

Where Classical ML Fits:
  • Structured Signals: Tabular, temporal, and metadata-driven learning
  • Baselines: Strong reference points for evaluating LLM gains
  • Monitoring: Drift detection, anomaly detection, and calibration
  • Efficiency: Lightweight models for cost-sensitive components
Classical ML Pipeline Raw Data Preprocessing Model Training (Supervised / Unsupervised) Evaluation & Deployment

3 Multimodal Learning: Beyond Text

Aligning multiple modalities for unified reasoning

Key Concepts:
  • Representation Alignment: Mapping text, vision, and signals into shared spaces
  • Cross-Modal Reasoning: Using one modality to disambiguate another
  • Vision-Language Models: Joint perception and language understanding
  • Real-World Inputs: Video, sensors, and structured context
Multimodal Representation Alignment Text Image Video Structured Data Shared Embedding Aligned representations Reasoning

4 Large Language Models as System Components

Treating LLMs as interfaces, planners, and controllers

Practical Building Blocks:
  • Instruction Tuning: Aligning model behavior with task intent
  • Post-Training: PPO, DPO, GRPO for preference alignment
  • PEFT: Efficient adaptation without full retraining
  • RAG: Grounding generation in external knowledge
LLM-Centric System Architecture User Query Retriever (RAG) Docs / KB LLM Instruction + Post-trained Tools / APIs Search • Code • DB Response + Logging + Evaluation

5 Reasoning and Agentic Behavior

Designing AI systems that plan, act, and adapt

Agent Design Elements:
  • Planning: Multi-step decision making over long horizons
  • Tool Use: APIs, search, code execution, and external actions
  • Memory: Persistent and episodic context
  • Control: Constraining behavior through policies and rewards
Agent Loop Observe Reason Act (Tools) Evaluate Update Memory

6 Safety, Robustness, and Evaluation

Ensuring trustworthy behavior in autonomous systems

Evaluation Focus:
  • Failure Modes: Identifying where and why systems break
  • Distribution Shift: Robustness beyond training data
  • Adversarial Inputs: Stress-testing unsafe behavior
  • Human-in-the-Loop: Oversight and corrective feedback
Safety-Aware Evaluation Loop Model Output Risk / Policy Scoring Safety checks Review Feedback Improve

7 How Everything Connects

Viewing AI as an integrated system

System Interactions:
  • Classical ML: Grounding and structure
  • Multimodal Models: Rich perception
  • LLMs: Language and reasoning
  • Agents: Long-horizon task execution
  • Evaluation: Trust and reliability
End-to-End AI System View Data ML Models LLM Agent Tools / APIs Evaluation & Safety

8 Final Thoughts

Where modern AI research is heading

Closing Reflections:
  • Bigger is Not Enough: Scaling alone does not ensure reliability
  • Systems Matter: Architecture, data, and evaluation define success
  • Agentic AI: The future lies in planning, memory, and control
  • Responsible Design: Safety and alignment are non-negotiable

Federated Learning

1 What is Federated Learning?

Distributed machine learning approach that preserves privacy

Key Principles:
  • Data Privacy: Data never leaves the device
  • Decentralized Training: Models trained locally on devices
  • Aggregation: Central server aggregates model updates
  • Communication Efficiency: Minimize data transfer

2 Federated Averaging Algorithm

The core algorithm for federated learning

def federated_averaging(global_model, client_models, client_weights):
    """
    Aggregate client models using weighted averaging
    
    Args:
        global_model: The current global model
        client_models: List of updated client models
        client_weights: Weight for each client (usually based on data size)
    """
    
    # Initialize aggregated parameters
    aggregated_params = {}
    
    for name, param in global_model.named_parameters():
        aggregated_params[name] = torch.zeros_like(param.data)
    
    # Weighted averaging
    for client_model, weight in zip(client_models, client_weights):
        for name, param in client_model.named_parameters():
            aggregated_params[name] += weight * param.data
    
    # Update global model
    for name, param in global_model.named_parameters():
        param.data = aggregated_params[name]
    
    return global_model

3 Challenges in Federated Learning

Understanding the limitations and ongoing research areas

Main Challenges:
  • Non-IID Data: Data distribution varies across clients
  • Communication Costs: Limited bandwidth and connectivity
  • System Heterogeneity: Different device capabilities
  • Privacy Attacks: Model inversion and membership inference
  • Byzantine Failures: Malicious or faulty clients