Loading project
Preparing this case study...
Preparing this case study...
A Python/Scrapy pipeline that collects local business leads, cleans and deduplicates contact data, scores outreach priority, and syncs actionable leads into ClickUp.
Project Snapshot
Technical Footprint
This project is a Python-based lead generation pipeline built to collect, clean, deduplicate, score, and organise local business leads around Sudbury and the CO10 area.
I built it to reduce the amount of manual work involved in finding local businesses, checking whether they had usable contact information, sorting duplicates across different directory sites, and deciding which leads were worth contacting first.
The system uses Scrapy spiders to collect business listings from multiple public directory sources. Some pages require browser rendering, so the project also uses Playwright through Scrapy where needed. The raw data is then normalised into a consistent CSV structure with fields such as business name, address, postcode, phone number, email, website, social links, category, source, and detail URL.
The pipeline does more than just scrape data. It standardises messy records, merges duplicate leads, extracts useful outreach signals, scores each lead, recommends a primary outreach channel, and exports a top-priority list for action. It can also sync scored leads into ClickUp so outreach work can be managed as tasks rather than sitting in a spreadsheet.
This is not a full web application yet. There is no frontend, dashboard, database, auth system, or internal API in the current codebase. The value of the project is in the data pipeline: turning scattered public business listings into a cleaner, ranked, outreach-ready lead list.
I built the Scrapy-based scraping pipeline, created multiple source-specific spiders, configured Playwright support for JavaScript-rendered pages, and defined the lead item schema used across the project.
I also created the post-processing scripts for standardising CSV outputs, merging and deduplicating records, enriching categories, scoring leads, filtering outreach-ready records, running basic QA counts, and syncing leads into ClickUp.
The project produced concrete lead generation outputs including a master lead CSV, scored leads CSV, top-50 outreach CSV, and ClickUp task mapping.
One of the main challenges was dealing with messy public directory data. Each source had its own structure, field quality, missing values, duplicated records, and inconsistent contact details, so the pipeline needed a standard format before the data could become useful.
Another challenge was deduplication. The same business can appear across multiple directory sites with slightly different names, phone numbers, websites, or addresses. I had to use a mix of exact matching and fuzzy matching so the final lead list was cleaner without losing useful records.
The project also had to move beyond scraping for the sake of scraping. Raw leads are not enough. The pipeline needed to help decide which leads were worth contacting first, which channel made most sense, and how the data could move into a real outreach workflow through ClickUp.
A limitation is that this is currently a CLI/data pipeline rather than a full web application. There is no dashboard, database, user account flow, or campaign management UI yet.
The pipeline reduces the manual effort needed to find and organise local business prospects. Instead of manually copying listings from directory sites, checking duplicates, and building outreach lists by hand, the system produces structured CSV outputs and a scored top-priority lead list.
It also creates a more practical outreach workflow by pushing leads into ClickUp, making it easier to track follow-up activity, priority, contact channel, and lead status.
As a portfolio project, it shows my ability to build useful automation around real business needs: scraping, data processing, contact enrichment, deduplication, lead scoring, and workflow integration.
This project taught me that lead generation is mostly a data quality problem. Collecting leads is only the first step. The real work is cleaning, normalising, deduplicating, scoring, and turning the data into something that can actually be used.
I also learnt how brittle public-source scraping can be. Directory websites can change structure, block requests, or return inconsistent data, so the pipeline needs to be designed with clear source-specific logic and realistic expectations.
The project helped me think more practically about automation. A useful system does not always need to start as a full app. Sometimes the most valuable first version is a reliable pipeline that produces a clean output and connects to an existing workflow tool.
I help founders and teams turn messy ideas into reliable systems — from MVPs and APIs to AI-enabled automation workflows.