Jamie Software Lab
Projects / GlobeScraper
TypeScript Next.js Prisma Scraping AI

GlobeScraper

A full-stack content, community, and rental data platform for English teachers relocating to Southeast Asia. Built from scratch with Next.js 14, a 970-line Prisma schema, a 7-source scraping pipeline, and an AI content engine powered by Google Gemini.

Status Live in production
Started Feb 2026
Commits 259+
Deploy Vercel + Hetzner

By the numbers

Scale at a glance.

55
API routes
30+
Prisma models
7
Scraper sources
45+
CLI scripts
~35k
Listings scraped
50+
React components
970
Schema lines
300+
GeoJSON districts

Features

What the platform actually does.

Rental marketplace
Search with filters (city, district, beds, type, price), pagination, image carousels, saved listings. Data from 6 active scraper sources across Cambodia.
Scraping pipeline
7 source adapters (Cheerio + Playwright for Cloudflare bypass). Parallel workers, atomic queue claiming, content fingerprinting, human-like pacing with jittered delays.
AI content engine
End-to-end article pipeline: competitor research via Serper.dev, gap analysis, Gemini 3 Flash generation, Imagen 4.0 images, auto-publish with SEO scoring.
Community
Public/private profiles, connections, DMs, meetups with RSVPs, trust panels, and a report/moderation system. Rate-limited via Upstash Redis.
Analytics heatmap
Interactive Leaflet map with 300+ Cambodia district boundaries. Daily/monthly price indices, KPI cards, trend charts, volatility analysis.
Email campaigns
9 block types, 5 template presets, AI content generation via Gemini, Resend integration with delivery tracking and scheduled cron delivery.

Rental pipeline

How data flows from source sites to the marketplace.

Discover
Crawl category pages from 6 sources, extract listing URLs, enqueue in ScrapeQueue.
Process
Atomic claiming, fetch + parse, upsert with content fingerprinting, geocoded titles.
AI review
Gemini classifies residential vs non-residential, corrects types, rewrites descriptions.
Index
Daily aggregation: median, mean, p25, p75 by city, district, beds, and type.

Stack

What powers each layer.

Framework
Next.js 14 (App Router)
Language
TypeScript 5.5
Database
MySQL + Prisma 5.18
Auth
Auth.js v5 (NextAuth)
AI
Gemini 3 Flash + Imagen 4.0
Scraping
Cheerio + Playwright
Rate limiting
Upstash Redis
Email
Resend + Vercel Cron
Maps
Leaflet + GeoJSON
Styling
Vanilla CSS (BEM)
Deploy
Vercel + Hetzner VPS
Testing
Vitest + Playwright

Decisions

Key architectural choices and why.

No Tailwind
All styling uses vanilla CSS with custom properties and BEM naming. Full control, no dependency bloat, easy to debug in DevTools.
Playwright for Cloudflare
Khmer24 blocks HTTP scrapers with Cloudflare WAF. Playwright with headless Chromium bypasses it. Other sources use lightweight Cheerio.
Human-like pacing
Jittered delays (1.2–2s), random breathers every 40–70 pages, night idle simulation, skip probability : avoids detection and bans.
Atomic queue claiming
Parallel workers claim batches via SQL UPDATE…LIMIT. No coordinator process, no locking conflicts, works with N workers.
Gemini over GPT
Google Gemini 3 Flash is fast, cheap, and handles structured JSON output well. Used for both article generation and listing classification.
Hetzner for scrapers
Scraper scripts need Playwright (browser automation) which can't run on Vercel serverless. The CX23 runs daily/weekly scrapes on a schedule.