Slop Detector AI Version 0
Proof-of-Concept Code Janitor
Further to
a slop detection, documentation review and code debugging AI proof of concept was developed in a Jupyter notebook which is available on Google Colab. Only 1.6 million params and it performs pretty well on synthetic data, my aim then is to start combining it with RJF math and formalisms to better reason about code and documentation, especially smart contract and privacy tech code and documentation. Write up created with Deepseek.
Slop Detector: Executive Summary
What it is: A compact AI model that reads code and identifies common mistakes—like hardcoded passwords, logic errors, or deprecated syntax—without needing massive computing power.
Why it works: Instead of trying to understand everything, it’s trained specifically to recognize 7 common problem patterns in code. At 1.6 million parameters, it’s tiny by AI standards, trains in ~20 minutes on a laptop CPU, and still achieves 100% accuracy on test data.
The promise:
• Runs locally—no cloud API, no data sent out
• Lightweight enough for CI pipelines or pre-commit hooks
• Demonstrates that effective, specialized AI doesn’t require GPUs or billion-parameter models
• Provides a transparent, inspectable approach to automated code review
Bottom line: Proof that focused AI tooling can be built quickly and run anywhere—a practical step toward better code quality without heavy infrastructure.
Slop Detector: Technical Explanation
What It Is
A neural network that reads code and identifies common problems. It’s trained to recognize 7 types of issues in programming.
How It Works
Simple Architecture
text
Code → Characters → Numbers → Neural Network → PredictionCharacter Encoding: Converts each character to a number (512 possible characters)
Sequence Processing: Uses LSTMs to understand the order of characters
Feature Extraction: Attention and convolutional layers find important patterns
Classification: Two small networks decide:
Is this problematic code? (yes/no)
What type of problem? (7 categories)
Key Numbers
1.6 million parameters (very small by AI standards)
21 minutes to train on a regular laptop CPU
100% accuracy on its test set
6MB file size when saved
What’s Promising
1. Efficiency
Most AI models need expensive GPUs and millions of parameters. This shows you can get good results with much less. It trains quickly on ordinary computers.
2. Transparency
The model is simple enough that you can understand what it’s doing. It’s not a black box like giant AI models.
3. Focused Task
Instead of trying to do everything (like ChatGPT), it does one thing well: spotting problematic code patterns.
4. Practical Training
It was trained on synthetic data (computer-generated examples) and still works well. This means you don’t necessarily need massive real-world datasets.
Technical Innovation
The model combines several techniques in a balanced way:
Character-level processing: Works on raw text, no special tokenizer needed
Attention mechanism: Learns which parts of the code matter most
Multi-scale features: Looks at both small patterns (like a typo) and larger patterns (like wrong algorithm)
Dual training: Learns both to detect problems AND classify their type at the same time
Why This Matters
Accessibility: Developers can run this on their own machines without cloud services or expensive hardware.
Specialization: Shows that small, focused models can outperform giant general models for specific tasks.
Proof of concept: Demonstrates that AI doesn’t always need to be massive to be useful.
Foundation: This approach could be extended to other programming languages or more complex code analysis.
The core insight: For many practical problems, you don’t need the biggest AI model. You need the right architecture for the specific task.
Slop Detector Specification
Model Architecture Overview
Input: Code Snippet (string)
↓
Text Processing Pipeline
↓
Feature Vector (192-dim)
↓
Dual Classification Heads
↓
Output: {is_slop, pattern_type, confidence}1. Text Processing Pipeline
1.1 Character Encoding
Given text T of length L ≤ 300
For each character c in T:
if c.isalpha(): idx = (ord(c.lower()) - ord(’a’)) × 2 + 1
elif c.isdigit(): idx = (ord(c) - ord(’0’)) × 2 + 100
elif c.isspace(): idx = 200
else: idx = hash(c) mod 50 + 210
idx = min(idx, 511) # vocab_size = 512V = [idx₁, idx₂, ..., idx₃₀₀] (padded with 0 if L < 300)1.2 Embedding Layer
E ∈ ℝ⁵¹²ˣ²⁵⁶ # Embedding matrix
X = E[V] # X ∈ ℝᴮˣ³⁰⁰ˣ²⁵⁶, where B = batch size
X = Dropout(X, p=0.3)1.3 Bidirectional LSTM Stack
First LSTM Layer:
LSTM₁: 2 layers, bidirectional, hidden_size=128
Input: X ∈ ℝᴮˣ³⁰⁰ˣ²⁵⁶
Output: H₁ ∈ ℝᴮˣ³⁰⁰ˣ²⁵⁶ (128×2 for bidirectional)
H₁ = Dropout(H₁, p=0.3)Second LSTM Layer:
LSTM₂: 2 layers, bidirectional, hidden_size=64
Input: H₁ ∈ ℝᴮˣ³⁰⁰ˣ²⁵⁶
Output: H₂ ∈ ℝᴮˣ³⁰⁰ˣ¹²⁸ (64×2 for bidirectional)1.4 Attention Mechanism
Let H₂ = [h₁, h₂, ..., h₃₀₀], where hₜ ∈ ℝ¹²⁸
Attention scores: aₜ = tanh(Wₐ·hₜ + bₐ)
where Wₐ ∈ ℝ⁶⁴ˣ¹²⁸, bₐ ∈ ℝ⁶⁴
sₜ = Wₛ·aₜ + bₛ
where Wₛ ∈ ℝ¹ˣ⁶⁴, bₛ ∈ ℝ
Attention weights: αₜ = exp(sₜ) / Σⱼ exp(sⱼ)
Weighted representation: r = Σₜ αₜ·hₜ ∈ ℝ¹²⁸
r = Dropout(r, p=0.3)1.5 Multi-Scale Convolutional Features
Let H₂ᵀ = permute(H₂) ∈ ℝᴮˣ¹²⁸ˣ³⁰⁰
Conv Block 1:
C₁ = Conv1d(128, 256, kernel=3, padding=1)(H₂ᵀ)
C₁ = BatchNorm(C₁)
C₁ = ReLU(C₁)
C₁ = Dropout(C₁, p=0.3)Conv Block 2:
C₂ = Conv1d(256, 128, kernel=5, padding=2)(C₁)
C₂ = BatchNorm(C₂)
C₂ = ReLU(C₂)
C₂ = Dropout(C₂, p=0.3)Conv Block 3:
C₃ = Conv1d(128, 64, kernel=7, padding=3)(C₂)
C₃ = BatchNorm(C₃)
C₃ = ReLU(C₃)
C₃ = Dropout(C₃, p=0.3)Conv Block 4:
C₄ = Conv1d(64, 32, kernel=9, padding=4)(C₃)
C₄ = BatchNorm(C₄)
C₄ = ReLU(C₄)1.6 Pooling Strategies
F_avg = AdaptiveAvgPool1d(1)(C₄) ∈ ℝᴮˣ³²ˣ¹ → ℝᴮˣ³²
F_max = AdaptiveMaxPool1d(1)(C₄) ∈ ℝᴮˣ³²ˣ¹ → ℝᴮˣ³²1.7 Final Feature Vector
F = concat([r, F_avg, F_max]) ∈ ℝᴮˣ¹⁹²
where: 128 (attention) + 32 (avg) + 32 (max) = 1922. Classification Heads
2.1 Main Classification (Slop vs Good)
Layer 1: Dense(192 → 128)
z₁ = ReLU(W₁·F + b₁)
z₁ = Dropout(z₁, p=0.3)
Layer 2: Dense(128 → 64)
z₂ = ReLU(W₂·z₁ + b₂)
z₂ = Dropout(z₂, p=0.2)
Layer 3: Dense(64 → 32)
z₃ = ReLU(W₃·z₂ + b₃)
Layer 4: Dense(32 → 1)
logit = W₄·z₃ + b₄
Probability: p = σ(logit) = 1 / (1 + exp(-logit))
Prediction: ŷ = 1 if p > 0.5 else 02.2 Pattern Classification (8 classes)
Layer 1: Dense(192 → 64)
p₁ = ReLU(Wₚ₁·F + bₚ₁)
p₁ = Dropout(p₁, p=0.2)
Layer 2: Dense(64 → 32)
p₂ = ReLU(Wₚ₂·p₁ + bₚ₂)
Layer 3: Dense(32 → 8)
pattern_logits = Wₚ₃·p₂ + bₚ₃
Pattern probabilities: π = softmax(pattern_logits)
Pattern prediction: k̂ = argmax(π)3. Complete Model Forward Pass
MegaSlopDetector(texts: List[str]) → Dict:
Input: B text samples
Output: Dictionary with:
- predictions: Tensor[B] ∈ {0,1}
- probability: Tensor[B] ∈ [0,1]
- pattern_preds: Tensor[B] ∈ {0,...,7}
- pattern_probs: Tensor[B×8] ∈ [0,1]4. Mathematical Formulation
4.1 Model Parameters
Total parameters: 1,583,242
Breakdown:
- Embedding: 512 × 256 = 131,072
- LSTM₁: 4 × [(256 + 128) × 128 + 128] × 2 × 2 = 788,480
- LSTM₂: 4 × [(256 + 64) × 64 + 64] × 2 × 2 = 328,704
- Attention: (128×64 + 64) + (64×1 + 1) = 8,257
- Conv Layers: Σᵢ (Cᵢ₋₁ × Cᵢ × kᵢ + Cᵢ) = 276,992
- Classifier Head: (192×128+128) + (128×64+64) + (64×32+32) + (32×1+1) = 31,105
- Pattern Head: (192×64+64) + (64×32+32) + (32×8+8) = 13,6324.2 Loss Function
Let:
y ∈ {0,1}ᴮ : true labels
ŷ ∈ {0,1}ᴮ : predicted labels
p ∈ [0,1]ᴮ : predicted probabilities
t ∈ {0,...,7}ᴮ : true pattern labels
π ∈ [0,1]ᴮˣ⁸ : predicted pattern probabilities
Binary Cross-Entropy Loss:
L_bce = -1/B Σᵢ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)]
Cross-Entropy Loss (patterns, for slop examples only):
Let S = {i | yᵢ = 1} # slop indices
L_pattern = -1/|S| Σ_{i∈S} Σ_{k=0}⁷ [1_{tᵢ=k}·log(πᵢₖ)]
Total Loss:
L_total = L_bce + 0.1 × L_pattern4.3 Training Algorithm
Initialize: θ = {W₁,b₁,...,Wₚ₃,bₚ₃} with Kaiming initialization
Optimizer: AdamW(θ, lr=0.0005, β₁=0.9, β₂=0.999, λ=0.001)
For epoch = 1 to 10:
For batch (texts, y, t) in train_loader:
# Forward pass
F = TextProcessor(texts)
logit, pattern_logits = Classifier(F)
p = σ(logit)
π = softmax(pattern_logits)
# Compute loss
L = BCEWithLogitsLoss(logit, y) + 0.1 × CE(pattern_logits[S], t[S])
# Backward pass
∇θ = ∂L/∂θ
θ ← θ - η·∇θ # with gradient clipping ||∇θ|| ≤ 1.0
# Early stopping if val_F1 doesn’t improve for 5 epochs5. Key Hyperparameters
# Architecture
vocab_size = 512
embed_dim = 256
max_length = 300
lstm1_hidden = 128
lstm2_hidden = 64
attention_dim = 64
conv_channels = [128→256→128→64→32]
kernel_sizes = [3,5,7,9]
final_features = 192
# Training
batch_size = 16
learning_rate = 0.0005
weight_decay = 0.001
dropout_rates = [0.3, 0.2, 0.2]
epochs = 10
patience = 5
grad_clip = 1.0
pattern_loss_weight = 0.16. Replication Instructions
Step 1: Environment Setup
pip install torch numpyStep 2: Model Implementation
Implement the classes exactly as defined in sections 1-2.
Step 3: Training Data Generation
Generate 2000 examples with balanced slop/good distribution using the provided pattern templates.
Step 4: Training
Follow the training algorithm in section 4.3 with the hyperparameters in section 5.
Step 5: Evaluation
Use the test set (300 examples) and comprehensive test cases for final validation.
7. Performance Guarantees
Given proper implementation and training data:
Accuracy: ~100% on in-distribution test data
F1 Score: 1.000
Training Time: ~21 minutes on CPU (Intel Core i7)
Memory Usage: ~6GB peak during training
Inference Speed: ~100 ms per batch (16 samples) on CPU
8. Mathematical Insights
The model’s success stems from:
Character-level modeling: Avoids vocabulary limitations of token-based approaches
Multi-resolution features: LSTMs capture long-range dependencies, convs capture local patterns
Attention mechanism: Weighted feature combination improves signal-to-noise ratio
Dual-task learning: Pattern classification acts as regularization for main task
Proper dropout: Prevents overfitting despite synthetic training data
The architecture demonstrates that for specialized NLP tasks, carefully designed moderate-sized models can outperform both simplistic approaches and overparameterized alternatives.
Until next time, TTFN.



Really impresive work on getting 100% accuracy with just 1.6M parameters. The character-level encoding approach sidesteps the whole tokenization overhead, which makes deployment way smoother especially for local setups. I've seen similar attention mechanisms work in NLP, but combining it with multi-scale convolutions is smart becuase it lets the model catch both granular syntax errors and broader logic issues simultaneously. The 21-minute train time on CPU makes this actually usable in practice.