Customize OpenAI privacy-filter for Snowflake semantic_categories
ActiveFine-tune OpenAI's privacy-filter model on Snowflake's semantic_categories taxonomy, then evaluate against a hand-labeled holdout.
Started · Updated
Overview
OpenAI’s privacy-filter model is trained on generic PII categories. Snowflake’s semantic_categories use a richer, domain-specific taxonomy (e.g. US_PASSPORT, IBAN_CODE, HEALTHCARE_NUMBER).
Goal: Customize the privacy-filter so it can output Snowflake-aligned categories directly, then evaluate quality against a hand-labeled holdout set.
Success criteria:
- Macro-F1 ≥ 0.80 on the holdout across all Snowflake categories that have ≥ 20 labeled examples.
- Latency p95 ≤ 150 ms per document at evaluation time.
Resources
- Introducing OpenAI Privacy Filter — announcement with architecture details (1.5B params, 50M active, 128k context)
- openai/privacy-filter on GitHub — Python CLI (
opf) for redaction, evaluation, and fine-tuning - openai/privacy-filter on Hugging Face — model weights (Apache 2.0),
AutoModelForTokenClassificationusage - Community fine-tuned variants — existing fine-tunes including quantized and domain-specific adaptations
- Snowflake
EXTRACT_SEMANTIC_CATEGORIES— reference taxonomy (47 categories as of 8.x) - ai4privacy/pii-masking-300k — OpenPII-220k (27 PII classes, 6 languages) + FinPII-80k (~20 finance/insurance classes); ~98.3% label accuracy
2026-05-06 — kickoff
- Pulled the full Snowflake
semantic_categoriesreference list (47 categories as of Snowflake 8.x). - Mapped each Snowflake category to the closest OpenAI
privacy-filteroutput label where one exists. ~30% have no direct mapping — these are the interesting gap cases. - Next: sample the gap categories from internal datasets and build a label schema for annotation.
2026-04-23 — initial exploration
Fine-tuning pipeline
opf trainships natively — handles JSONL ingestion, 128-token banded attention windowing, AdamW with gradient accumulation, checkpoint serialization.- Two demo workflows: policy adaptation (relabel existing categories) and new taxonomy (custom label space).
- Output head remapping: exact-match labels copy weights directly; new labels warm-start from closest base class (e.g.
B-custom_idinheritsB-IDweights). This is why OpenAI’s benchmark jumped from 54% → 96% F1 on small data. - Custom taxonomies configured via
label_space.jsonwithspan_class_names. BIOES expansion is automatic; background classOmust be first entry.
Compute requirements
- Model: 2.8 GB safetensors (BF16), 1.5B total params / 50M active (MoE).
- Full fine-tune: ~24 GB VRAM (mixed precision) — single A100 40GB or RTX 4090 24GB.
- LoRA alternative: ~12 GB, viable on consumer GPUs.
PoC plan (one-week target)
- Environment + baseline — clone repo, pull checkpoint (~17 GB), verify
opf redacton sample text, download ai4privacy English splits, convert toopfeval JSONL. - Baseline eval — collapse ai4privacy’s 27 classes → Privacy Filter’s 8 categories, run
opf eval, reproduce ~96% F1 as sanity check, note per-category breakdown. - Taxonomy design — design ~15–20 target categories aligned to Snowflake’s
SEMANTIC_CATEGORY(NAME, EMAIL, PAYMENT_CARD, PASSPORT, NATIONAL_IDENTIFIER, STREET_ADDRESS, PHONE_NUMBER, IP_ADDRESS, DATE_OF_BIRTH, AGE, GENDER, OCCUPATION, SALARY, MEDICAL_CONDITION, MEDICATION…). Writelabel_space.json, maintain a mapping file. - Data prep + smoke training — generate JSONL with new labels, 80/10/10 split, run
opf trainon 5–10k subset to validate pipeline end-to-end. - Full fine-tune — ~150k examples, 2–3 epochs, single A100, checkpoint every N steps.
- Evaluation —
opf evalon held-out test, build confusion matrix. Focus on overlapping pairs: NAME vs ORGANIZATION_IDENTIFIER, NATIONAL_IDENTIFIER vs TAX_IDENTIFIER, PAYMENT_CARD vs BANK_ACCOUNT.
Key risks
- Category overlap degradation — expect 10–20 pt F1 drop on overlapping numeric-ID categories.
- Label skew — ai4privacy over-represents
firstname; eval needs stratified sampling to avoid misleading macro-F1. - OOD gap — ai4privacy covers education/health/psychology/finance; Snowflake customer data (log lines, transaction records, support tickets) may differ.
- Licensing — ai4privacy is academic-friendly; production use at Snowflake needs commercial license or synthetic dataset alternative.