AI 데이터셋 레지스트리

Hugging Face·Kaggle 등에서 수집한 학습 데이터셋과 라이선스·PII·오염 신호. 모델·MCP·패키지와 같은 API에서 같은 방식으로 조회합니다.

0.60
TypicaAI/pii-masking-60k_fr
pkg:data/TypicaAI/pii-masking-60k_fr

PII French dataset This PII French dataset is based on the World's largest open-source privacy dataset: ai4privacy/pii-masking-200k. The original dataset ai4privacy/pii-masking-200k was filtered out, using a BERT-based language classifier, to keep only French rows. This dataset was created solely for educational purposes. For more information, please refer to the dataset ai4privacy/pii-masking-200k.

huggingface token-classification 34 회 다운로드
0.60
cicero-im/piiptbrchatml
pkg:data/cicero-im/piiptbrchatml

Dataset Card for PII PT-BR ChatML The piiptbrchatml dataset is designed for training and evaluating models for Personal Identifiable Information (PII) masking in Brazilian Portuguese. It contains conversations where a system is instructed to mask PII from user inputs. The dataset includes the original text, the masked text, and the identified PII entities. O dataset piiptbrchatml foi criado para treinar e avaliar modelos para mascaramento de Informações Pessoais Identificáveis… See the full description on the dataset page: https://huggingface.co/datasets/cicero-im/piiptbrchatml.

huggingface 26 회 다운로드
0.60
cicero-im/modified
pkg:data/cicero-im/modified

Dataset Card for Modified-Anonymization-Dataset This dataset contains anonymization examples in Portuguese. It consists of text samples where Personally Identifiable Information (PII) has been masked. The dataset includes the original text, the masked text, the identified PII entities, and information about potential data pollution introduced during the anonymization process. Este dataset cont[u00e9m exemplos de anonimiza[u00e7[u00e3o em portugu[u00eas. Consiste em amostras de… See the full description on the dataset page: https://huggingface.co/datasets/cicero-im/modified.

huggingface 16 회 다운로드
0.60
hyunjunian/PIITEST
pkg:data/hyunjunian/PIITEST

huggingface text-classification 11 회 다운로드
0.55
lianghsun/tw-PII-chat
pkg:data/lianghsun/tw-PII-chat

Taiwan PII Chat (tw-PII-chat) v3 release (2026-05) — 2.4× larger than v2, fixes v2's distribution-shift regression on tw-PII-bench mid/long splits. Supersedes both v1 (61K synthetic short-form) and v2 (76K mixed) releases. Property Value Languages Traditional Chinese (zh-TW), English Items 183,588 Format Chat-format JSON (messages field) + raw NER spans (text + spans) Labels 8 in-schema (matching openai/privacy-filter) + 11 Taiwan-specific OOD Generation Mixed:… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/tw-PII-chat.

huggingface token-classification apache-2.0 41 회 다운로드
0.55
Vrandan/pii-harmonized-corpus-v2
pkg:data/Vrandan/pii-harmonized-corpus-v2

pii-harmonized-corpus-v2 Harmonized + synthetic-augmented English-only PII NER training corpus, derived from three public datasets and Kimi K2.6 synthetic generation. Stats at a glance Train rows: 204,546 Test rows: 90,160 Total spans (train): 865,473 Total spans (test): 528,449 Real rows in train: 180,892 Synthetic rows in train: 23,654 (11.6%) Languages: English only (language == "en" for all rows) ML labels: 46 entity types Tagging: BILOU at training time (1 + 4×46… See the full description on the dataset page: https://huggingface.co/datasets/Vrandan/pii-harmonized-corpus-v2.

huggingface token-classification other 38 회 다운로드
0.55
ai4privacy/pii-masking-health-phi-200k
pkg:data/ai4privacy/pii-masking-health-phi-200k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. 🇪🇺 Personal Health & Medical Information — European PII Dataset Part of PII-Masking-2M (2,717,080 entries) by AI4Privacy Entries PII Annotations Labels Languages Regions 252,437 1,686,246 49 23 29 PII Label Distribution European Coverage Languages: English (12%) · French (9%) · German (9%) ·… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-health-phi-200k.

huggingface token-classification other 38 회 다운로드
0.55
ai4privacy/pfi-masking-100k-full
pkg:data/ai4privacy/pfi-masking-100k-full

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Financial Information (PFI) Masking Dataset — Full Overview The EPII PFI Masking Dataset is a large-scale, multilingual dataset of 160,403 annotated text samples containing synthetic Personal Financial Information. Each entry includes source text with embedded PII, a masked version, character-level privacy annotations… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pfi-masking-100k-full.

huggingface token-classification other 27 회 다운로드
0.55
ai4privacy/pli-masking-100k-full
pkg:data/ai4privacy/pli-masking-100k-full

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Location Information (PLI) Masking Dataset — Full Overview The EPII PLI Masking Dataset is a large-scale, multilingual dataset of 91,314 annotated text samples containing synthetic Personal Location Information. Each entry includes source text with embedded PII, a masked version, character-level privacy annotations, and… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pli-masking-100k-full.

huggingface token-classification other 25 회 다운로드
0.55
ai4privacy/pdi-masking-100k-full
pkg:data/ai4privacy/pdi-masking-100k-full

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Digital Information (PDI) Masking Dataset — Full Overview The EPII PDI Masking Dataset is a large-scale, multilingual dataset of 91,400 annotated text samples containing synthetic Personal Digital Information. Each entry includes source text with embedded PII, a masked version, character-level privacy annotations, and… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pdi-masking-100k-full.

huggingface token-classification other 23 회 다운로드
0.55
ai4privacy/pii-masking-work-pwi-200k
pkg:data/ai4privacy/pii-masking-work-pwi-200k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. 🇪🇺 Personal Work & HR Information — European PII Dataset Part of PII-Masking-2M (2,717,080 entries) by AI4Privacy Entries PII Annotations Labels Languages Regions 252,273 1,383,008 41 23 29 PII Label Distribution European Coverage Languages: English (12%) · French (9%) · German (9%) · Spanish (6%)… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-work-pwi-200k.

huggingface token-classification other 23 회 다운로드
0.55
ai4privacy/phi-masking-100k-full
pkg:data/ai4privacy/phi-masking-100k-full

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Health Information (PHI) Masking Dataset — Full Overview The EPII PHI Masking Dataset is a large-scale, multilingual dataset of 91,339 annotated text samples containing synthetic Personal Health Information. Each entry includes source text with embedded PII, a masked version, character-level privacy annotations, and… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/phi-masking-100k-full.

huggingface token-classification other 23 회 다운로드
0.55
ai4privacy/pwi-masking-100k-full
pkg:data/ai4privacy/pwi-masking-100k-full

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Work Information (PWI) Masking Dataset — Full Overview The EPII PWI Masking Dataset is a large-scale, multilingual dataset of 91,559 annotated text samples containing synthetic Personal Work Information. Each entry includes source text with embedded PII, a masked version, character-level privacy annotations, and mBERT… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pwi-masking-100k-full.

huggingface token-classification other 22 회 다운로드
0.55
ai4privacy/pii-masking-location-pli-200k
pkg:data/ai4privacy/pii-masking-location-pli-200k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. 🇪🇺 Personal Location & Travel Information — European PII Dataset Part of PII-Masking-2M (2,717,080 entries) by AI4Privacy Entries PII Annotations Labels Languages Regions 256,762 2,050,300 54 23 29 PII Label Distribution European Coverage Languages: English (12%) · French (9%) · German (9%) ·… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-location-pli-200k.

huggingface token-classification other 22 회 다운로드
0.55
ai4privacy/pii-masking-digital-pdi-200k
pkg:data/ai4privacy/pii-masking-digital-pdi-200k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. 🇪🇺 Personal Digital Information — European PII Dataset Part of PII-Masking-2M (2,717,080 entries) by AI4Privacy Entries PII Annotations Labels Languages Regions 198,319 815,110 33 23 29 PII Label Distribution European Coverage Languages: English (12%) · French (10%) · German (9%) · Spanish (6%) ·… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-digital-pdi-200k.

huggingface token-classification other 22 회 다운로드
0.55
ai4privacy/pii-masking-financial-pfi-200k
pkg:data/ai4privacy/pii-masking-financial-pfi-200k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. 🇪🇺 Personal Financial Information — European PII Dataset Part of PII-Masking-2M (2,717,080 entries) by AI4Privacy Entries PII Annotations Labels Languages Regions 257,434 1,563,807 48 23 29 PII Label Distribution European Coverage Languages: English (13%) · French (9%) · German (9%) · Spanish (6%)… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-financial-pfi-200k.

huggingface token-classification other 22 회 다운로드
0.50
SII-lyXiang/SynthLeak
pkg:data/SII-lyXiang/SynthLeak

SynthLeak SynthLeak is a dialogue dataset for studying contextual privacy leakage (CPL) in user-generated text. Each item is a multi-turn or single-post forum-style dialogue with structured annotations linking privacy clues in the dialogue to personal information items (PII). This release follows the dataset setting described in: Hangyu Ye, Liyao Xiang, Naixuan Huang, Dongyue Yu, Lijun Zhang, and Gang Wang. PrivSniffer: Graph-based Contextual Privacy Leakage Detection for… See the full description on the dataset page: https://huggingface.co/datasets/SII-lyXiang/SynthLeak.

huggingface token-classification 672 회 다운로드
0.50
bbeglerov/russian-pi-66k-opf
pkg:data/bbeglerov/russian-pi-66k-opf

Russian PII 66K OPF Format This dataset is a converted, OPF-compatible version of wolframko/russian-pii-66k. It is intended for fine-tuning OpenAI Privacy Filter with a hybrid label space: the standard OPF v2 labels plus Russian PII categories that do not have direct standard OPF equivalents. See label_space.json. Files data/train.jsonl: 59087 records data/validation.jsonl: 6565 records label_space.json: OPF custom label space The split was created from source split… See the full description on the dataset page: https://huggingface.co/datasets/bbeglerov/russian-pi-66k-opf.

huggingface 130 회 다운로드
0.50
TheoDB/french-pii-eval
pkg:data/TheoDB/french-pii-eval

French PII Evaluation Dataset A curated French PII detection evaluation and training dataset, built for benchmarking TheoDB/privacy-filter-fr. Dataset Structure Split Examples Purpose test.jsonl 2,500 Held-out evaluation — never used in training test_english.jsonl 426 English regression check train.jsonl 57,248 Training data val.jsonl 500 Validation data Label Taxonomy 8 PII classes (same as openai/privacy-filter): Class Test… See the full description on the dataset page: https://huggingface.co/datasets/TheoDB/french-pii-eval.

huggingface token-classification 117 회 다운로드
0.50
cicero-im/analysis_results
pkg:data/cicero-im/analysis_results

Dataset Card for Analysis Results This dataset contains synthetic text samples generated to evaluate the quality of PII masking. The samples are rated on factors like incorporation, structure, consistency and richness. The dataset can be used to train and evaluate models for PII detection and masking, and to analyze the trade-offs between data utility and privacy. Dataset Structure The dataset consists of generated text samples containing PII (Personally Identifiable… See the full description on the dataset page: https://huggingface.co/datasets/cicero-im/analysis_results.

huggingface 104 회 다운로드
0.50
disi-unibo-nlp/physionet-deid-i2b2-2014
pkg:data/disi-unibo-nlp/physionet-deid-i2b2-2014

The De-identification dataset contains medical text records with Named Entity Recognition (NER) annotations. The dataset is processed to split records into individual sentences while preserving entity annotations. Each sentence is tokenized and annotated in IOB format for training NER models.

huggingface token-classification 75 회 다운로드
0.50
ai4privacy/pfi-masking-100k
pkg:data/ai4privacy/pfi-masking-100k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Financial Information (PFI) Masking Preview Dataset Overview This dataset provides a preview (400 samples) of the EPII Personal Financial Information (PFI) Masking Dataset, a specialized collection designed for identifying and masking sensitive personal financial information within text data. This preview demonstrates… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pfi-masking-100k.

huggingface token-classification other 49 회 다운로드
0.50
auren-research/pii-shield
pkg:data/auren-research/pii-shield

PII Shield: Multilingual PII Detection Dataset PII Shield is a large-scale, multilingual dataset for training and evaluating Personally Identifiable Information (PII) detection models. Built by Auren Research, it combines real-world documents from diverse domains with high-quality span-level PII annotations produced by fastino/gliner2-privacy-filter-PII-multi— achieving the highest F1 on the SPY benchmark among open-source PII detectors. The dataset is designed to… See the full description on the dataset page: https://huggingface.co/datasets/auren-research/pii-shield.

huggingface token-classification cc-by-4.0 49 회 다운로드
0.50
joneauxedgar/pasteproof-pii-dataset-v3
pkg:data/joneauxedgar/pasteproof-pii-dataset-v3

PasteProof PII Dataset v3 Synthetic PII detection dataset with intentional confusion to prevent overfitting. What's Different in v3 Problem v2 v3 Fix Key names always match content apiKey → API_KEY 30% use generic names like data, x, field1 No lookalikes - Mixed real PII with fake lookalikes Always structured JSON/SQL/etc 20% raw PII without context Easy negatives Generic code Hard negatives (test cards, example.com emails) Generation… See the full description on the dataset page: https://huggingface.co/datasets/joneauxedgar/pasteproof-pii-dataset-v3.

huggingface token-classification mit 47 회 다운로드
0.50
shivaniachary123/pii-masking-health-phi-preview
pkg:data/shivaniachary123/pii-masking-health-phi-preview

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. PII Masking Personal Health & Medical Information (PHI) — Preview 50 sample entries from the PII-Masking-2M European release by AI4Privacy. Source text and PII values are redacted in this preview. Contact us for full access. Label Distribution Language Distribution European Coverage Full Dataset… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/pii-masking-health-phi-preview.

huggingface token-classification cc-by-4.0 46 회 다운로드
0.50
LocalDoc/pii_ner_azerbaijani
pkg:data/LocalDoc/pii_ner_azerbaijani

PII NER Azerbaijani Dataset Short, synthetic Azerbaijani dataset for PII-aware Named Entity Recognition (token classification). Useful for training and evaluating models that detect and localize personally identifiable information (PII) in Azerbaijani text. Note: All examples are synthetically generated with the library az-data-generator https://github.com/LocalDoc-Azerbaijan/az-data-generator. No real persons or contact details are included. Dataset Summary Each row… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani.

huggingface token-classification cc-by-4.0 46 회 다운로드
0.50
JALAPENO11/model-inversion-adversarial
pkg:data/JALAPENO11/model-inversion-adversarial

Model Inversion Adversarial Dataset 39,950 (original, anonymized) sentence pairs (target: 40,000) for black-box model inversion attack research against PII anonymization models. Each record contains the original PII-rich sentence and the BART-anonymized output produced by a fine-tuned BART-base anonymizer, along with rich metadata. Splits Split Count train 38,032 eval 1,918 total 39,950 Probing Strategies Strategy Count Purpose S1… See the full description on the dataset page: https://huggingface.co/datasets/JALAPENO11/model-inversion-adversarial.

huggingface text-generation mit 46 회 다운로드
0.50
ai4privacy/phi-masking-100k
pkg:data/ai4privacy/phi-masking-100k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Health Information (PHI) Masking Preview Dataset Overview This dataset provides a preview (400 samples) of the EPII Personal Health Information (PHI) Masking Dataset, a specialized collection designed for identifying and masking sensitive personal health information within text data. This preview demonstrates the data… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/phi-masking-100k.

huggingface token-classification other 45 회 다운로드
0.50
micmadAAU/Nemotron-PII
pkg:data/micmadAAU/Nemotron-PII

Nemotron-PII: Synthesized Data for Privacy-Preserving AI Dataset Description Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/micmadAAU/Nemotron-PII.

huggingface token-classification cc-by-4.0 45 회 다운로드
0.50
BoB14TeamSentinel/sentinel-kr-sensitive-entities-synthetic-v3
pkg:data/BoB14TeamSentinel/sentinel-kr-sensitive-entities-synthetic-v3

Sentinel KR Sensitive Entities (Synthetic) v3 Overview Sentinel KR Sensitive Entities (Synthetic) v3 is a Korean synthetic (AI-generated) dataset for whitelist-only sensitive-entity detection in DLP / LLM guardrail scenarios. All sensitive values in this dataset (e.g., phone numbers, emails, IDs, tokens, keys) are artificially generated by AI and do not come from real individuals, real incidents, or collected private datasets. Any resemblance to real persons or real… See the full description on the dataset page: https://huggingface.co/datasets/BoB14TeamSentinel/sentinel-kr-sensitive-entities-synthetic-v3.

huggingface token-classification cc-by-4.0 44 회 다운로드
0.50
ai4privacy/pdi-masking-100k
pkg:data/ai4privacy/pdi-masking-100k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Digital Information (PDI) Masking Preview Dataset Overview This dataset provides a preview (400 samples) of the EPII Personal Digital Information (PDI) Masking Dataset, a specialized collection designed for identifying and masking sensitive personal digital information within text data. This preview demonstrates the data… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pdi-masking-100k.

huggingface token-classification other 43 회 다운로드
0.50
aniket-curlscape/pii-masking-english-5k
pkg:data/aniket-curlscape/pii-masking-english-5k

Important This repository contains the English-only subset of the Ai4Privacy PII-Masking-300k Dataset. The dataset is curated to provide English texts only, while retaining the structure, labeling schema, and licensing of the original dataset. Licensing Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.* Terms… See the full description on the dataset page: https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-5k.

huggingface text-classification other 42 회 다운로드
0.50
agentlans/personal-information-prompts
pkg:data/agentlans/personal-information-prompts

Personal Information Prompts This dataset contains multilingual prompts derived from the all_sample subset of the agentlans/allenai-WildChat-4.8M dataset. Each prompt features artificially inserted personally identifiable information (PII) generated randomly with the Faker Python package for various locales. Each rewritten prompt uses the google/gemma-3-12b-it model to incorporate the synthetic personal data. Dataset fields for the two configurations: classification… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/personal-information-prompts.

huggingface text-classification cc-by-4.0 41 회 다운로드
0.50
huggingbahl21/saha-al
pkg:data/huggingbahl21/saha-al

SAHA-AL: PII Anonymization Benchmark SAHA-AL is a benchmark for training and evaluating text anonymization systems. It goes beyond detection accuracy by evaluating anonymization as a system under attack — measuring adversarial re-identification risk, contextual privacy leakage, and a formalized privacy-utility tradeoff. Key Features 3 evaluation tasks: PII detection, text anonymization quality, and adversarial privacy risk 11 metrics spanning leakage, utility, format… See the full description on the dataset page: https://huggingface.co/datasets/huggingbahl21/saha-al.

huggingface text-generation mit 40 회 다운로드
0.50
mukuls9971/indian-address-v1
pkg:data/mukuls9971/indian-address-v1

Indian Address Synthetic Dataset v1 Synthetic multilingual Indian-address token-classification dataset generated by the pii-model-oss project. Repository Dataset repo: mukuls9971/indian-address-v1 Train split: 12000 Validation split: 1000 Test split: 1000 Files train.jsonl validation.jsonl test.jsonl report.json Notes Generated and published by the pii-model-oss workflow. Upstream datasets used to assemble benchmark variants retain their own… See the full description on the dataset page: https://huggingface.co/datasets/mukuls9971/indian-address-v1.

huggingface token-classification mit 39 회 다운로드
0.50
UniDataPro/synthetic-turkish-passports
pkg:data/UniDataPro/synthetic-turkish-passports

Turkish passport dataset - 5, 000 images Dataset comprises 5,000 meticulously organized files capturing Turkish passports under highly controlled variations, making it an invaluable resource for developing robust document recognition and verification systems. It is specifically designed for training and testing models in passport authentication, biometric data extraction, and identity verification. By leveraging this dataset containing detailed information from Turkish passports… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/synthetic-turkish-passports.

huggingface image-to-text cc-by-nc-nd-4.0 36 회 다운로드
0.50
shivaniachary123/sovereign-pii-detection-v1
pkg:data/shivaniachary123/sovereign-pii-detection-v1

🛡️ Sovereign PII Detection Dataset (v1.0) Maintainer: Cata Risk Lab | Project: Wattle Guard 🌍 Dataset Summary This synthetic dataset contains labeled examples of Sovereign Identity Markers specific to the Swiss, UK, and Australian jurisdictions. It is designed to train and benchmark the Wattle Guard redaction engine, ensuring compliance with cross-border data protection laws (nFADP, UK GDPR, Privacy Act 1988). Unlike generic PII datasets that focus on US data (SSN)… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/sovereign-pii-detection-v1.

huggingface token-classification mit 36 회 다운로드
0.50
gravitee-io/pii-detection-dataset
pkg:data/gravitee-io/pii-detection-dataset

Gravitee PII Detection A harmonized, multi-source corpus for fine-tuning encoder-style PII / NER models. 25 canonical PII classes, character-level span annotations, 175,881 English examples, 781,052 entity spans. Published as a single split (train). Hold-out evaluation is expected to be performed against unrelated external PII corpora rather than against a slice of this dataset. Quick start from datasets import load_dataset ds =… See the full description on the dataset page: https://huggingface.co/datasets/gravitee-io/pii-detection-dataset.

huggingface token-classification apache-2.0 35 회 다운로드
0.50
shivaniachary123/pii-detection-corpus
pkg:data/shivaniachary123/pii-detection-corpus

PII Detection Corpus Synthetic dataset of text samples containing labeled PII (Personally Identifiable Information) for testing and benchmarking PII detection/scrubbing tools. Fields text: Text sample containing PII pii_type: Category of PII (email, phone, ssn, credit_card, ip, dob, address, passport, api_key, name, iban) pii_value: The exact PII string in the text start: Character offset start end: Character offset end context: Surrounding context category (medical… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/pii-detection-corpus.

huggingface token-classification mit 35 회 다운로드
0.50
tugrulkaya/turkish-pii-dataset
pkg:data/tugrulkaya/turkish-pii-dataset

🔒 Turkish PII Detection Dataset Türkçe metinlerde Kişisel Tanımlanabilir Bilgi (PII) tespiti için el ile etiketlenmiş, araştırma amaçlı bir NER veri kümesi. KVKK ve GDPR uyumlu yapay zeka geliştirme için temel bir kaynak olarak tasarlanmıştır. Veri Kümesi Özeti Örnek sayısı: ~20 etiketlenmiş metin (küçük ölçekli, başlangıç seviyesi) Kategoriler: 7+ PII türü Dil: Türkçe Format: JSON / token-level etiketleme Amaç: Eğitim, araştırma, anonimleştirme prototipleri… See the full description on the dataset page: https://huggingface.co/datasets/tugrulkaya/turkish-pii-dataset.

huggingface token-classification cc-by-4.0 35 회 다운로드
0.50
575-lab/kiji-inspector-reviewed-pairs
pkg:data/575-lab/kiji-inspector-reviewed-pairs

Kiji PII Detection Training Data Synthetic multilingual dataset for training PII (Personally Identifiable Information) detection models with token-level entity annotations and coreference resolution. Dataset Summary Samples 99,990 (train: 89,991, test: 9,999) Languages 6 (Dutch, Spanish, German, English, Danish, French) Countries 20 PII entity types 26 Total entity annotations 814,306 (avg 8.1 per sample) Coreference clusters 142,142 (99% of… See the full description on the dataset page: https://huggingface.co/datasets/575-lab/kiji-inspector-reviewed-pairs.

huggingface token-classification apache-2.0 35 회 다운로드
0.50
vstantch/x402-pii-corpus
pkg:data/vstantch/x402-pii-corpus

x402 PII Metadata Corpus Synthetic labelled corpus of 2,000 x402 payment metadata triples for PII filter evaluation. Released alongside the paper "Hardening x402: Privacy-Preserving Agentic Payments via Pre-Execution Metadata Filtering". Paper: arXiv:2604.11430 [cs.CR] Canonical archive: IEEE DataPort doi:10.21227/kpsz-nq73 Code: presidio-v/presidio-hardened-x402 Dataset description Each record represents one x402 payment metadata triple (resource_url, description… See the full description on the dataset page: https://huggingface.co/datasets/vstantch/x402-pii-corpus.

huggingface token-classification mit 34 회 다운로드
0.50
ai4privacy/pwi-masking-100k
pkg:data/ai4privacy/pwi-masking-100k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Work Information (PWI) Masking Preview Dataset Overview This dataset provides a preview (400 samples) of the EPII Personal Work Information (PWI) Masking Dataset, a specialized collection designed for identifying and masking sensitive personal work information within text data. This preview demonstrates the data… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pwi-masking-100k.

huggingface token-classification other 33 회 다운로드
0.50
ShmalexFlow/whiteout-compliance-benchmark
pkg:data/ShmalexFlow/whiteout-compliance-benchmark

Whiteout AI Compliance Benchmark A 15,915-prompt benchmark for evaluating AI compliance engines — systems that enforce content policies on user prompts before they reach AI providers. Built by Groovy Security for the Whiteout AI platform. Dataset Summary Property Value Total prompts 15,915 Categories 9 (PHI, PII, GDPR, Legal, Code, Confidential, Security, Finance, Education) Policies 74 across all categories Prompt types 3 (safe, violation, edge_case)… See the full description on the dataset page: https://huggingface.co/datasets/ShmalexFlow/whiteout-compliance-benchmark.

huggingface text-classification apache-2.0 29 회 다운로드
0.50
orgrctera/pii_masking_300k_information_extraction
pkg:data/orgrctera/pii_masking_300k_information_extraction

PII Masking 300k — Information Extraction Dataset summary This repository hosts a validation sample of the PII Masking 300k benchmark for the information extraction track: models must identify personally identifiable information (PII) in text and produce structured extractions (slot-filling JSON), optional token-level BIO labels, and span-based annotations for masking or redaction workflows. The full PII Masking 300k suite is designed to stress-test privacy-preserving… See the full description on the dataset page: https://huggingface.co/datasets/orgrctera/pii_masking_300k_information_extraction.

huggingface token-classification apache-2.0 29 회 다운로드
0.50
ahczhg/Nemotron-PII
pkg:data/ahczhg/Nemotron-PII

Nemotron-PII: Synthesized Data for Privacy-Preserving AI Dataset Description Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/ahczhg/Nemotron-PII.

huggingface token-classification cc-by-4.0 26 회 다운로드
0.50
NAMANDREWLV/pii-masking-95k-preencoded
pkg:data/NAMANDREWLV/pii-masking-95k-preencoded

VI PII Masking (Pre-encoded) – 95k Private Vietnamese dataset for PII detection and masking, designed for token classification / NER in privacy-preserving NLP systems. Dataset Overview Language: Vietnamese (vi) Domain: Privacy / PII masking Total samples: ~95,000 Splits: Train: 76,097 Validation: 9,512 Test: 9,513 Format: JSONL Data Fields Raw & masked text source_text masked_text privacy_mask language region script split uid… See the full description on the dataset page: https://huggingface.co/datasets/NAMANDREWLV/pii-masking-95k-preencoded.

huggingface token-classification other 25 회 다운로드
0.50
Ari-S-123/better-english-pii-anonymizer
pkg:data/Ari-S-123/better-english-pii-anonymizer

PII Detection Combined Dataset Combined dataset for PII (Personally Identifiable Information) detection, merging the ai4privacy English-only subset with synthetically generated challenging examples targeting NER failure modes. Dataset Description This dataset combines two sources: ai4privacy/open-pii-masking-500k (English subset): 120,533 train / 30,160 test examples Synthetic data (Grok-4.1-Non-reasoning generated/GPT-5.1 validated): 4,801 train / 1,201 test examples… See the full description on the dataset page: https://huggingface.co/datasets/Ari-S-123/better-english-pii-anonymizer.

huggingface token-classification mit 22 회 다운로드
0.50
UniDataPro/synthetic-printed-brazilian-passports
pkg:data/UniDataPro/synthetic-printed-brazilian-passports

Brazilian passport dataset The dataset comprises 5,000 high-resolution synthetic photos of Brazilian passports, designed to advance computer vision and identity verification systems. It provides a secure and ethical resource for training robust models for OCR (Optical Character Recognition), document analysis, and spoofing detection, all without exposing real personal data or sensitive personal information. By utilizing this dataset, researchers and developers can enhance security… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/synthetic-printed-brazilian-passports.

huggingface image-to-text cc-by-nc-nd-4.0 20 회 다운로드
0.50
UniDataPro/synthetic-printed-australian-passports
pkg:data/UniDataPro/synthetic-printed-australian-passports

Australian passport dataset The dataset comprises 5,000 high-resolution synthetic photos of ** Australian passports**, designed to advance computer vision and identity verification systems. It provides a secure and ethical resource for training robust models for OCR (Optical Character Recognition), document analysis, and spoofing detection, all without exposing real personal data or sensitive personal information. This dataset is an essential tool for organizations and government… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/synthetic-printed-australian-passports.

huggingface image-to-text cc-by-nc-nd-4.0 19 회 다운로드
0.50
Cata-Risk-Lab/sovereign-pii-detection-v1
pkg:data/Cata-Risk-Lab/sovereign-pii-detection-v1

🛡️ Sovereign PII Detection Dataset (v1.0) Maintainer: Cata Risk Lab | Project: Wattle Guard 🌍 Dataset Summary This synthetic dataset contains labeled examples of Sovereign Identity Markers specific to the Swiss, UK, and Australian jurisdictions. It is designed to train and benchmark the Wattle Guard redaction engine, ensuring compliance with cross-border data protection laws (nFADP, UK GDPR, Privacy Act 1988). Unlike generic PII datasets that focus on US data (SSN)… See the full description on the dataset page: https://huggingface.co/datasets/Cata-Risk-Lab/sovereign-pii-detection-v1.

huggingface token-classification mit 19 회 다운로드
0.50
joelbarmettler/gheim-ch-pii-212k
pkg:data/joelbarmettler/gheim-ch-pii-212k

gheim-ch-pii-212k Summary. 212,503-chunk multilingual PII NER dataset covering the four official Swiss languages and English. 84% is real text from the Apertus pretrain corpora (Swiss court rulings, federal parliament records, Swiss-filtered web text, Romansh corpus); the remaining 16% is template- and LLM-generated synthetic prose used to populate cells where real-text coverage was insufficient. Annotations are machine generated by three independent open-weights LLMs (Gemma… See the full description on the dataset page: https://huggingface.co/datasets/joelbarmettler/gheim-ch-pii-212k.

huggingface token-classification cc-by-4.0 17 회 다운로드
0.50
darkmatter2222/redact-v1
pkg:data/darkmatter2222/redact-v1

Redact-v1 Synthetic Data Card Overview This dataset consists of 100% synthetic data—every element is artificially generated. No data originates from any genuine or external source. The full repository, including synthetic data generation models and use cases, is available on GitHub: https://github.com/darkmatter2222/NLU-Redact-PII. Categories of Synthetic Sensitive Data The dataset includes the following categories of artificially generated sensitive data:… See the full description on the dataset page: https://huggingface.co/datasets/darkmatter2222/redact-v1.

huggingface apache-2.0 15 회 다운로드
0.50
dbabis/20NG_5topics_PII_annotated
pkg:data/dbabis/20NG_5topics_PII_annotated

20 Newsgroups (5 Topics) — PII-Augmented version Description This dataset is a curated subset of the 20 Newsgroups corpus, containing 5 clearly distinguishable topics for experimentation with intelligent text anonymization and topic classification It was created as part of the Bachelor’s thesis “Intelligent anonymization for natural language processing and inference” at FIIT STU, 2025 Versions A. 20NG_5topics.jsonl Original subset with 5 selected… See the full description on the dataset page: https://huggingface.co/datasets/dbabis/20NG_5topics_PII_annotated.

huggingface token-classification mit 14 회 다운로드
0.50
EdyVision/pii-skills-ablation
pkg:data/EdyVision/pii-skills-ablation

PII Skills Ablation Benchmark This repository provides the benchmark and experiment configuration for the ablation study described in: "When Parametric Knowledge Wins: A Controlled Ablation of Agent Skills and Tool Use for PII Detection in Small Language Models" The benchmark is a stratified sample of text with ground-truth PII spans aligned to PII-Codex canonical types. It is used to evaluate whether zero-shot prompting, documentation injection (+Docs), tool access (+Tool), or… See the full description on the dataset page: https://huggingface.co/datasets/EdyVision/pii-skills-ablation.

huggingface mit 12 회 다운로드
0.50
paperboy-ai/desktop-pii-210
pkg:data/paperboy-ai/desktop-pii-210

Desktop PII 210 This directory is a local Hugging Face-compatible dataset package for synthetic desktop screenshots with expected privacy and utility QA pairs. The screenshots are generated synthetic desktop scenes. The visible sensitive values are fictional benchmark strings, not real personal data. Contents images/: all 210 generated PNG screenshots. images/metadata.jsonl: Hugging Face imagefolder metadata, one row per image. data/train.parquet: the default Hugging… See the full description on the dataset page: https://huggingface.co/datasets/paperboy-ai/desktop-pii-210.

huggingface image-to-text other 9 회 다운로드
0.50
Shayfra7926/PANOPTICON
pkg:data/Shayfra7926/PANOPTICON

PANOPTICON Dataset Summary PANOPTICON (PII-based Assemblage of Naturalistic Output–Prompt Tuples for Investigating Privacy Leakage in Conversational AI) is a dataset of synthetic, PII-bearing prompts designed to enable controlled evaluation of privacy leakage / prompt inversion behaviors in LLMs. The dataset is organized by high-level Category and Scenario, and includes fields that support separating PII spans from surrounding benign context for analysis.… See the full description on the dataset page: https://huggingface.co/datasets/Shayfra7926/PANOPTICON.

huggingface text-generation other 8 회 다운로드
0.50
EdyVision/pii-skills-ablation-results
pkg:data/EdyVision/pii-skills-ablation-results

PII Skills Ablation — Scored Results This repository contains model predictions and evaluation scores for the ablation study described in: "When Parametric Knowledge Wins: A Controlled Ablation of Agent Skills and Tool Use for PII Detection in Small Language Models" Results are produced by running four open-weight instruction-tuned models (Gemma 2 9B, Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) under four conditions (zero-shot, +Docs, +Tool, +Skills) on the benchmark in… See the full description on the dataset page: https://huggingface.co/datasets/EdyVision/pii-skills-ablation-results.

huggingface mit 7 회 다운로드
0.50
lucianfialho/privacy-filter-br-dataset
pkg:data/lucianfialho/privacy-filter-br-dataset

privacy-filter-br Dataset Synthetic Portuguese (BR) PII detection dataset used to fine-tune the privacy-filter-br NER model. 22 PII categories, BIOES tagging compatible. Latest: v8.1 main aponta sempre pra última versão estável. Hoje: v8.1 (172075 train + 17072 holdout). from datasets import load_dataset ds = load_dataset("lucianfialho/privacy-filter-br-dataset") # ou pin: load_dataset("lucianfialho/privacy-filter-br-dataset", revision="v8.1") Schema… See the full description on the dataset page: https://huggingface.co/datasets/lucianfialho/privacy-filter-br-dataset.

huggingface token-classification apache-2.0 5 회 다운로드
0.50
FreekCoolAI/kvk-pii-checker
pkg:data/FreekCoolAI/kvk-pii-checker

KVK PII-checker dataset Nederlandstalige Q&A-dataset voor het trainen van een micro-LLM (Gemma-3-1B) als privacy/AVG-checker: gegeven een tekst die iemand in een AI-tool zou willen plakken, geeft het model een 3-regels oordeel. Format Elk voorbeeld: { "category": "pii-check/...", "question": "<tekst die getoetst wordt>", "answer": "Oordeel: <VEILIG|ANONIMISEER EERST|NIET VERSTUREN>\nGevonden: ...\nAdvies: ..." } Oordeel-klassen VEILIG — geen… See the full description on the dataset page: https://huggingface.co/datasets/FreekCoolAI/kvk-pii-checker.

huggingface text-generation cc-by-4.0 0 회 다운로드
0.45
SulthanAbiyyu/anak-baik
pkg:data/SulthanAbiyyu/anak-baik

Anak-Baik Dataset: Overview Anak-Baik dataset is a collection of instruction-output pairs in Bahasa Indonesia, designed for Supervised Fine-Tuning (SFT) tasks. The dataset contains examples of both harmful and harmless outputs, aimed at promoting ethical AI development (hence the name; anak baik == good boy :D). The dataset consists of pairs of instructions and their corresponding outputs, categorized as either harmful or harmless and their topics. This structure enables models to… See the full description on the dataset page: https://huggingface.co/datasets/SulthanAbiyyu/anak-baik.

huggingface text-generation 14 회 다운로드
0.45
SulthanAbiyyu/anak-baik-rejection-classification
pkg:data/SulthanAbiyyu/anak-baik-rejection-classification

Anak-Baik Rejection Classification: Overview The Anak-Baik Rejection Classification dataset is a curated collection of labeled instructional rejections in Bahasa Indonesia, specifically designed for Supervised Fine-Tuning (SFT) tasks. This dataset includes examples of both harmful and harmless instructions, along with labels indicating whether an instruction should be answered or rejected. This dataset aimed at promoting ethical AI development (hence the name; anak baik == good… See the full description on the dataset page: https://huggingface.co/datasets/SulthanAbiyyu/anak-baik-rejection-classification.

huggingface text-classification 5 회 다운로드
0.40
ai4privacy/pii-masking-300k
pkg:data/ai4privacy/pii-masking-300k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts: OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-300k.

huggingface text-classification other 7,644 회 다운로드
0.40
nvidia/Nemotron-PII
pkg:data/nvidia/Nemotron-PII

Nemotron-PII: Synthesized Data for Privacy-Preserving AI Dataset Description Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-PII.

huggingface token-classification cc-by-4.0 4,030 회 다운로드
0.40
ai4privacy/pii-masking-400k
pkg:data/ai4privacy/pii-masking-400k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. AI4Privacy Dataset Analytics 📊 Dataset Overview Total entries: 406,896… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-400k.

huggingface text-classification other 2,832 회 다운로드
0.40
Meddies/meddies-pii
pkg:data/Meddies/meddies-pii

Meddies PII Synthetic PII extraction data for multilingual clinical and administrative documents, with language-specific, domain-transfer, translation, and instruction-style views in one Hub repo. [!IMPORTANT] This is a synthetic de-identification research artifact for healthcare AI teams. It is not medical advice, not a privacy certification, and not a substitute for task-specific validation on your own data. If you want to use this dataset in commercial work… See the full description on the dataset page: https://huggingface.co/datasets/Meddies/meddies-pii.

huggingface token-classification cc-by-nc-4.0 2,052 회 다운로드
0.40
ai4privacy/pii-masking-openpii-1m
pkg:data/ai4privacy/pii-masking-openpii-1m

OpenPII 1M — Multilingual PII Masking Dataset Overview The OpenPII 1M dataset is a large-scale, multilingual collection of 1,428,143 synthetic text examples with fine-grained PII (Personally Identifiable Information) annotations, spanning 23 European languages and 19 entity types. Built to advance open research in privacy-preserving NLP, this dataset enables the development and benchmarking of Named Entity Recognition (NER) models, token classification pipelines… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m.

huggingface token-classification other 1,769 회 다운로드
0.40
ai4privacy/open-pii-masking-500k-ai4privacy
pkg:data/ai4privacy/open-pii-masking-500k-ai4privacy

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy p5y Data Analytics… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy.

huggingface text-classification other 1,510 회 다운로드
0.40
guardion/BR-Agentic-PII-Benchmark
pkg:data/guardion/BR-Agentic-PII-Benchmark

BR-Agentic-PII-Benchmark Overview BR-Agentic-PII-Benchmark is a highly annotated dataset of synthetic multi-turn conversations between humans and AI banking assistants in Brazilian Portuguese. It is designed to benchmark the processes of detection, anonymization, de-anonymization, and transparency for AI agents. Key Use Cases This dataset helps evaluate systems that redact and protect Personally Identifiable Information (PII) before it leaves a secure perimeter… See the full description on the dataset page: https://huggingface.co/datasets/guardion/BR-Agentic-PII-Benchmark.

huggingface text-generation mit 1,010 회 다운로드
0.40
gretelai/synthetic_pii_finance_multilingual
pkg:data/gretelai/synthetic_pii_finance_multilingual

Image generated by DALL-E. See prompt for more details 💼 📊 Synthetic Financial Domain Documents with PII Labels gretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0. This dataset is designed to assist with the following use cases: 🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.

huggingface text-classification apache-2.0 995 회 다운로드
0.40
nvidia/Privasis-Zero
pkg:data/nvidia/Privasis-Zero

Privasis-Zero Dataset Description: Privasis-Zero is a large-scale synthetic dataset consisting of diverse text records—such as medical and financial records, legal documents, emails, and messages—containing rich, privacy-sensitive information. Each record includes synthetic profile details, surrounding social context, and annotations of privacy-related content. All data are fully generated using LLMs, supplemented with first names sourced from the U.S. Social Security… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Privasis-Zero.

huggingface text-generation other 684 회 다운로드
0.40
anony-mouse123/Instruction_recall_dataset
pkg:data/anony-mouse123/Instruction_recall_dataset

CanaryBench-PII Frequency-aware canary injection benchmark for auditing memorization in finetuned language models, built on the AI4Privacy PII reconstruction task. Dataset Description This dataset is part of CanaryBench, a benchmark for evaluating memorization in finetuned language models across repetition tiers and privacy regimes. Frequency tiers: 1×, 10×, 50× PII types: EMAIL, PHONE Member canaries: 770 Reference canaries: 1000 Tasks: PII detection, secret recall… See the full description on the dataset page: https://huggingface.co/datasets/anony-mouse123/Instruction_recall_dataset.

huggingface text-generation cc-by-4.0 621 회 다운로드
0.40
mks-logic/SPY
pkg:data/mks-logic/SPY

SPY: Enhancing Privacy with Synthetic PII Detection Dataset We proudly present the SPY Dataset, a novel synthetic dataset for the task of Personal Identifiable Information (PII) detection. This dataset highlights the importance of safeguarding PII in modern data processing and serves as a benchmark for advancing privacy-preserving technologies. Key Highlights Innovative Generation: We present a methodology for developing the SPY dataset and compare it to other… See the full description on the dataset page: https://huggingface.co/datasets/mks-logic/SPY.

huggingface token-classification cc-by-4.0 490 회 다운로드
0.40
jmdanto/corpus-essms-public
pkg:data/jmdanto/corpus-essms-public

Corpus social et medico-social (export public) Apercu Ce dataset contient des ecrits professionnels en francais du secteur social et medico-social. L'export public diffuse ici est compose de rapports fictifs mais realistes (les documents reels ont ete exclus), et vise des usages de NER, pseudonymisation, extraction d'entites et evaluation de modeles en contexte metier. Volume de l'export public: 410 enregistrements fictifs realistes dans data.jsonl (texte + metadonnees)… See the full description on the dataset page: https://huggingface.co/datasets/jmdanto/corpus-essms-public.

huggingface token-classification apache-2.0 434 회 다운로드
0.40
hivetrace/pii-bench
pkg:data/hivetrace/pii-bench

PII-Bench (ru) Бенчмарк для оценки качества детекции персональных данных (PII) в русскоязычных текстах. Использует span-level разметку с явными индексами начала, конца символов и названия сущности, что позволяет валидировать такие системы как Presidio как ML-модели, так и регулярные выражения, фокусируя не оценку самих моделей в формате IO, BIO или BILOU, а фокусираясь на комплексной оценке всего NER пайплайна. Формат данных { id: chat_03, domain: L-CHAT… See the full description on the dataset page: https://huggingface.co/datasets/hivetrace/pii-bench.

huggingface token-classification other 372 회 다운로드
0.40
tomekkorbak/pile-pii-scrubadub
pkg:data/tomekkorbak/pile-pii-scrubadub

Dataset Card for pile-pii-scrubadub Dataset Summary This dataset contains text from The Pile, annotated based on the personal idenfitiable information (PII) in each sentence. Each document (row in the dataset) is segmented into sentences, and each sentence is given a score: the percentage of words in it that are classified as PII by Scrubadub. Supported Tasks and Leaderboards [More Information Needed] Languages This dataset is taken from The… See the full description on the dataset page: https://huggingface.co/datasets/tomekkorbak/pile-pii-scrubadub.

huggingface text-classification ['mit'] 360 회 다운로드
0.40
UniDataPro/synthetic-printed-usa-passports-dataset
pkg:data/UniDataPro/synthetic-printed-usa-passports-dataset

Passport Dataset - 9 600 Images The dataset comprises 9,600 high-quality synthetically generated passport images, providing a robust resource for training and verifying document analysis systems. Every passport is presented across 3 angles (0°, 25°, 45°), 4 lighting conditions (Natural-daylight, Office-LED, Warm-indoor, Dim-light), 4 backgrounds (Neutral wall, Textured desk, Outdoor pavement, Docs-on-docs), and 2 distances (Close, Medium), creating a rich and challenging dataset for… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/synthetic-printed-usa-passports-dataset.

huggingface image-to-text cc-by-nc-nd-4.0 332 회 다운로드
0.40
raayraay/privacyleak-pii
pkg:data/raayraay/privacyleak-pii

PrivacyLeak-PII A Machine Unlearning Benchmark for Personal Information Extraction All data in this dataset is synthetically generated using Faker. No real PII is included. The Problem We're Solving Current unlearning evaluations only check if models refuse direct questions about "forgotten" data. But real attackers don't ask nicely: Direct question (model refuses): "What is John Doe's SSN?" Prefix completion (model leaks): "Customer Name: John Doe Issue:… See the full description on the dataset page: https://huggingface.co/datasets/raayraay/privacyleak-pii.

huggingface text-generation mit 305 회 다운로드
0.40
Pritesh-2711/pii-bench
pkg:data/Pritesh-2711/pii-bench

PIIBench Description PIIBench is a unified benchmark dataset for PII detection across multiple domains. Paper arXiv: http://arxiv.org/abs/2604.15776 Dataset Summary Total records: ~1.39M Entity types: 48 Format: BIO tagging Structure Each example contains: tokens: list of tokens labels: BIO labels source: original data source of the sample Splits train.jsonl validation.jsonl test.jsonl Source Ten datasets are… See the full description on the dataset page: https://huggingface.co/datasets/Pritesh-2711/pii-bench.

huggingface token-classification apache-2.0 292 회 다운로드
0.40
lianghsun/tw-PII-bench
pkg:data/lianghsun/tw-PII-bench

Taiwan PII Benchmark (tw-PII-bench) A token-classification benchmark for evaluating PII detectors on Taiwan-specific personally identifiable information in Traditional Chinese (繁體中文). Designed against openai/privacy-filter to surface its label-coverage gaps and locale-specific failure modes. The benchmark has three splits by text length, so you can isolate where a model breaks (boundary handling, long-context coverage, multi-PII reasoning): Split Items Text lengthAvg PII… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/tw-PII-bench.

huggingface token-classification apache-2.0 274 회 다운로드
0.40
Ganasekhar/pii-masking-400k
pkg:data/Ganasekhar/pii-masking-400k

Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. AI4Privacy Dataset Analytics 📊 Dataset Overview Total entries: 406,896 Total tokens: 20,564,179 Total PII tokens: 2,357,029 Number of PII classes in public dataset: 17 Number of PII classes in extended dataset:… See the full description on the dataset page: https://huggingface.co/datasets/Ganasekhar/pii-masking-400k.

huggingface text-classification other 268 회 다운로드
0.40
DataikuNLP/kiji-pii-training-data
pkg:data/DataikuNLP/kiji-pii-training-data

Kiji PII Detection Training Data Synthetic multilingual dataset for training PII (Personally Identifiable Information) detection models with token-level entity annotations and coreference resolution. Dataset Summary Samples 51,495 (train: 46,345, test: 5,150) Languages 6 (English, Danish, Dutch, French, Spanish, German) Countries 20 PII entity types 26 Total entity annotations 397,441 (avg 7.7 per sample) Coreference clusters 0 (0% of samples)… See the full description on the dataset page: https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data.

huggingface token-classification apache-2.0 257 회 다운로드
0.40
temsa/OpenMed-Irish-CorePII-TrainMix-v1
pkg:data/temsa/OpenMed-Irish-CorePII-TrainMix-v1

OpenMed Irish Core PII Train Mix v1 Composite token-classification training mix used to fine-tune temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1. This repo is the training dataset, not the model itself. What A Row Looks Like Each row uses a fixed schema so the Hugging Face dataset viewer and datasets.load_dataset() can read it directly: id: row id inside the split text: reconstructed text string tokens: tokenized text labels: BIO labels aligned to tokens language:… See the full description on the dataset page: https://huggingface.co/datasets/temsa/OpenMed-Irish-CorePII-TrainMix-v1.

huggingface token-classification cc-by-4.0 236 회 다운로드
0.40
AdamiTitus/pii-masking-300k
pkg:data/AdamiTitus/pii-masking-300k

Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts: OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/AdamiTitus/pii-masking-300k.

huggingface text-classification other 231 회 다운로드
0.40
saad-kw-almutairi/pii-masking-300k
pkg:data/saad-kw-almutairi/pii-masking-300k

Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts: OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/saad-kw-almutairi/pii-masking-300k.

huggingface text-classification other 210 회 다운로드
0.40
tursunait/roberta-pii-synth
pkg:data/tursunait/roberta-pii-synth

Synthetic PII Detection Dataset (RoBERTa-PII-Synth) A large-scale, fully synthetic dataset for training token-classification models to detect Personally Identifiable Information (PII) in realistic text. This dataset was built using an enhanced synthetic generation pipeline, designed to better capture the linguistic and formatting variability of real-world user text. All samples are fully artificial — no real people or identifiers appear anywhere. 📘 Dataset Summary… See the full description on the dataset page: https://huggingface.co/datasets/tursunait/roberta-pii-synth.

huggingface token-classification mit 188 회 다운로드
0.40
anony-mouse123/enron_canary
pkg:data/anony-mouse123/enron_canary

CanaryBench-Enron Frequency-aware canary injection benchmark for auditing memorization in finetuned language models, built on the Enron email corpus. Dataset Description This dataset is part of CanaryBench, a benchmark for evaluating memorization in finetuned language models across repetition tiers and privacy regimes. Frequency tiers: 1×, 10×, 50× Domain: Email (Enron corpus) Member canaries: 770 Reference canaries: 1000 Files… See the full description on the dataset page: https://huggingface.co/datasets/anony-mouse123/enron_canary.

huggingface text-generation cc-by-4.0 187 회 다운로드
0.40
DomainShield/InternalPiiDataset
pkg:data/DomainShield/InternalPiiDataset

Internal PII Benchmark A synthetic dataset for training and evaluating models on the detection of domain-specific PII — organization-internal identifiers that conventional PII systems fail to recognize. Unlike traditional PII (names, emails, phone numbers), this dataset targets terms such as internal team names, restricted locations, communication channels, infrastructure labels, and operational procedures (e.g., gamma squad, secure chamber, inner route). These terms are… See the full description on the dataset page: https://huggingface.co/datasets/DomainShield/InternalPiiDataset.

huggingface token-classification mit 177 회 다운로드
0.40
ai4privacy/pii-masking-health-phi-preview
pkg:data/ai4privacy/pii-masking-health-phi-preview

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. PII Masking Personal Health & Medical Information (PHI) — Preview 50 sample entries from the PII-Masking-2M European release by AI4Privacy. Source text and PII values are redacted in this preview. Contact us for full access. Label Distribution Language Distribution European Coverage Full Dataset… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-health-phi-preview.

huggingface token-classification cc-by-4.0 167 회 다운로드
0.40
ai4privacy/openpii-masking-nano-1k
pkg:data/ai4privacy/openpii-masking-nano-1k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. OpenPII-Masking-Nano-1K - The Fast PII Detection Benchmark A compact, provider-agnostic benchmark for quick PII detection evaluation. The little sibling of OpenPII-Masking-Mini-10K. Same methodology, same 23 languages, same 19 entity types - just 1K samples for rapid iteration, CI/CD pipelines, and quick provider comparisons.… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k.

huggingface token-classification cc-by-4.0 165 회 다운로드
0.40
enosislabs/matex-privacy-sentinel-dataset
pkg:data/enosislabs/matex-privacy-sentinel-dataset

MaTE X Privacy Sentinel Dataset Synthetic token-classification dataset for training a local privacy/security filter based on OpenAI Privacy Filter. Purpose This dataset teaches a local filter to detect and redact sensitive spans in developer workflows before context is sent to external LLMs. Target domains include: .env files terminal logs stack traces git diffs GitHub issues and PR comments agent traces tool outputs workspace memory auth, database, cloud and payment… See the full description on the dataset page: https://huggingface.co/datasets/enosislabs/matex-privacy-sentinel-dataset.

huggingface token-classification apache-2.0 153 회 다운로드
0.40
joneauxedgar/pasteproof-pii-dataset-v2
pkg:data/joneauxedgar/pasteproof-pii-dataset-v2

PasteProof PII Dataset v2 Improved synthetic dataset for training PII detection models. What's New in v2 Dynamic templates: Templates generated on-the-fly with random variation More format variations: Each entity type has many format options Hard negatives: 10% of samples are tricky non-PII that looks like PII Variable key names: apiKey, api_key, API_KEY, etc. More entity generators: Many more API key formats, card types, etc. Entity Types (27) Financial:… See the full description on the dataset page: https://huggingface.co/datasets/joneauxedgar/pasteproof-pii-dataset-v2.

huggingface token-classification mit 151 회 다운로드
0.40
mindbomber/aana-peer-review-evidence-pack
pkg:data/mindbomber/aana-peer-review-evidence-pack

AANA Peer Review Evidence Pack This dataset packages the current public evidence for AANA as an architecture for making agents more auditable, safer, more grounded, and more controllable. The claim boundary is intentionally narrow: AANA is production-candidate as an audit/control/verification/correction layer. AANA is not yet proven as a raw agent-performance engine. Results here are measured held-out or validation artifacts, not official leaderboard proof unless a benchmark… See the full description on the dataset page: https://huggingface.co/datasets/mindbomber/aana-peer-review-evidence-pack.

huggingface mit 135 회 다운로드
0.40
RedactionBench/RedactionBench
pkg:data/RedactionBench/RedactionBench

Dataset Card for RedactionBench RedactionBench is an evaluation-only benchmark for character-level redaction across eleven document categories. Each of the 200 documents is manually-annotated with character spans that are either mandatory (must redact) or contextual. RedactionBench mixes 101 real-world documents manually sourced from the public web (transcribed, augmented) with 99 synthetic documents authored to fill categories where synthetic data is more appropriate. The above is… See the full description on the dataset page: https://huggingface.co/datasets/RedactionBench/RedactionBench.

huggingface token-classification cc-by-4.0 132 회 다운로드
0.40
scanpatch/pii-ner-corpus-synthetic-controlled
pkg:data/scanpatch/pii-ner-corpus-synthetic-controlled

PII NER Corpus - Synthetic Controlled A controlled synthetic dataset for training Named Entity Recognition models to detect Personally Identifiable Information (PII) in Ukrainian and Russian text. This dataset was generated using a controlled pipeline with human-verified annotation guidelines. The text samples are based on real-world document patterns and annotated using Claude Sonnet 4 with strict quality controls. Dataset Description This dataset contains text… See the full description on the dataset page: https://huggingface.co/datasets/scanpatch/pii-ner-corpus-synthetic-controlled.

huggingface token-classification mit 131 회 다운로드
0.40
compliancemas/ComplianceMAS-Bench
pkg:data/compliancemas/ComplianceMAS-Bench

ComplianceMAS-Bench Dataset Description ComplianceMAS-Bench is the first systematic benchmark for evaluating compliance behaviour in multi-agent memory systems. It comprises 269 scenarios spanning 5 compliance failure-mode categories and 4 regulated domains, grounded in HIPAA and GDPR requirements. Paper: ComplianceMAS: A Systematic Benchmark for Evaluating Compliance Behaviour in Multi-Agent Memory Systems (NeurIPS 2025 submission)Repository:… See the full description on the dataset page: https://huggingface.co/datasets/compliancemas/ComplianceMAS-Bench.

huggingface text-classification apache-2.0 129 회 다운로드
0.40
aniket-curlscape/pii-masking-english
pkg:data/aniket-curlscape/pii-masking-english

Important This repository contains the English-only subset of the Ai4Privacy PII-Masking-300k Dataset. The dataset is curated to provide English texts only, while retaining the structure, labeling schema, and licensing of the original dataset. Licensing Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.* Terms… See the full description on the dataset page: https://huggingface.co/datasets/aniket-curlscape/pii-masking-english.

huggingface text-classification other 124 회 다운로드
0.40
Ari-S-123/pii-detection-english-consolidated
pkg:data/Ari-S-123/pii-detection-english-consolidated

PII Detection Combined Dataset Combined dataset for PII (Personally Identifiable Information) detection, merging the ai4privacy English-only subset with synthetically generated and semantically validated with different LLMs challenging examples targeting NER failure modes. Class labels had to be consolidated to prevent label fragmentation too. Dataset Description This dataset combines two sources: ai4privacy/open-pii-masking-500k (English subset): 120,533 train / 30,160… See the full description on the dataset page: https://huggingface.co/datasets/Ari-S-123/pii-detection-english-consolidated.

huggingface token-classification mit 124 회 다운로드
0.40
vkatg/streaming-phi-deidentification-benchmark
pkg:data/vkatg/streaming-phi-deidentification-benchmark

Streaming PHI De-Identification Benchmark Most PHI de-identification benchmarks evaluate a single document in isolation. That is not how clinical data actually moves. A patient's name appears in a clinical note, then in an ASR transcript ten minutes later, then in imaging metadata an hour after that. Each event looks low-risk on its own. The cumulative exposure across modalities is what creates re-identification risk. This dataset captures that. Every record is fully synthetic. It… See the full description on the dataset page: https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark.

huggingface mit 118 회 다운로드
0.40
ai4privacy/pii-masking-digital-pdi-preview
pkg:data/ai4privacy/pii-masking-digital-pdi-preview

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. PII Masking Personal Digital Information (PDI) — Preview 50 sample entries from the PII-Masking-2M European release by AI4Privacy. Source text and PII values are redacted in this preview. Contact us for full access. Label Distribution Language Distribution European Coverage Full Dataset The… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-digital-pdi-preview.

huggingface token-classification cc-by-4.0 114 회 다운로드
0.40
aniket-curlscape/pii-masking-english-100
pkg:data/aniket-curlscape/pii-masking-english-100

Important This repository contains the English-only subset of the Ai4Privacy PII-Masking-300k Dataset. The dataset is curated to provide English texts only, while retaining the structure, labeling schema, and licensing of the original dataset. Licensing Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.* Terms… See the full description on the dataset page: https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-100.

huggingface text-classification other 113 회 다운로드
0.40
ai4privacy/pii-masking-financial-pfi-preview
pkg:data/ai4privacy/pii-masking-financial-pfi-preview

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. PII Masking Personal Financial Information (PFI) — Preview 50 sample entries from the PII-Masking-2M European release by AI4Privacy. Source text and PII values are redacted in this preview. Contact us for full access. Label Distribution Language Distribution European Coverage Full Dataset The… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-financial-pfi-preview.

huggingface token-classification cc-by-4.0 111 회 다운로드
0.40
aniket-curlscape/pii-masking-english-1k
pkg:data/aniket-curlscape/pii-masking-english-1k

Important This repository contains the English-only subset of the Ai4Privacy PII-Masking-300k Dataset. The dataset is curated to provide English texts only, while retaining the structure, labeling schema, and licensing of the original dataset. Licensing Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.* Terms… See the full description on the dataset page: https://huggingface.co/datasets/aniket-curlscape/pii-masking-english-1k.

huggingface text-classification other 108 회 다운로드
0.40
VytautoDidziojoUniversitetas/NUS-LT-PII-corpus
pkg:data/VytautoDidziojoUniversitetas/NUS-LT-PII-corpus

NUS Lithuanian PII Corpus Description Lithuanian text annotated for personal information (PII), spanning three subject domains — administrative, scientific, and media — plus a stratified validation set. The corpus covers 24 entity types: 16 general categories (PER, LOC, ORG, …) and 8 GDPR special-category "sensitive" entities (REL, POL, SEX, GENDER, MAR, FAM, ETH, HEALTH). Dataset Summary Subsets: 4 (3 training categories + 1 validation set) Total records: 41… See the full description on the dataset page: https://huggingface.co/datasets/VytautoDidziojoUniversitetas/NUS-LT-PII-corpus.

huggingface token-classification openrail 108 회 다운로드
0.40
ai4privacy/pii-masking-location-pli-preview
pkg:data/ai4privacy/pii-masking-location-pli-preview

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. PII Masking Personal Location & Travel Information (PLI) — Preview 50 sample entries from the PII-Masking-2M European release by AI4Privacy. Source text and PII values are redacted in this preview. Contact us for full access. Label Distribution Language Distribution European Coverage Full Dataset… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-location-pli-preview.

huggingface token-classification cc-by-4.0 106 회 다운로드
0.40
ai4privacy/openpii-masking-mini-10k
pkg:data/ai4privacy/openpii-masking-mini-10k

OpenPII Masking Mini 10K A compact, stratified subset of ai4privacy/pii-masking-openpii-1m, containing 10,000 samples for rapid experimentation, fine-tuning, and benchmarking of PII detection and masking models. Sampling Methodology Samples were selected using proportional stratified sampling by language: Target count per language = round(lang_proportion × 10,000) — proportional representation. Streaming + reservoir sampling collected 3× the target candidates per… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k.

huggingface token-classification cc-by-4.0 101 회 다운로드
0.40
temsa/OpenMed-Irish-PPSN-Eircode-Spec-v1
pkg:data/temsa/OpenMed-Irish-PPSN-Eircode-Spec-v1

OpenMed Irish PPSN Eircode Spec v1 Focused synthetic token-classification dataset for Irish PPSN and Eircode detection. This repo contains synthetic training rows, not a fine-tuned model. What A Row Looks Like Each row uses a fixed schema: id: row id inside the split text: rendered text string tokens: tokenized text labels: BIO labels aligned to tokens language: en or ga source_dataset: generator identifier source_domain: optional domain tag, empty in this release… See the full description on the dataset page: https://huggingface.co/datasets/temsa/OpenMed-Irish-PPSN-Eircode-Spec-v1.

huggingface token-classification apache-2.0 99 회 다운로드
0.40
wan9yu/pii-bench-zh
pkg:data/wan9yu/pii-bench-zh

PII Bench ZH Chinese PII (Personally Identifiable Information) detection benchmark dataset. Two subsets covering formal and informal Chinese text, with character-level span annotations. This is the first open Chinese PII benchmark that covers locale-specific formats (phone, national ID, bank card, license plate, address) with precise offsets. Disclaimer / 免责声明 This dataset is 100% synthetic and intended solely for research and evaluation purposes. It does not contain any real… See the full description on the dataset page: https://huggingface.co/datasets/wan9yu/pii-bench-zh.

huggingface token-classification apache-2.0 99 회 다운로드
0.40
shivaniachary123/pii-masking-400k
pkg:data/shivaniachary123/pii-masking-400k

Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. AI4Privacy Dataset Analytics 📊 Dataset Overview Total entries: 406,896 Total tokens: 20,564,179 Total PII tokens: 2,357,029 Number of PII classes in public dataset: 17 Number of PII classes in extended dataset:… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/pii-masking-400k.

huggingface text-classification other 94 회 다운로드
0.40
shivaniachary123/pii-masking-300k
pkg:data/shivaniachary123/pii-masking-300k

Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts: OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/pii-masking-300k.

huggingface text-classification other 93 회 다운로드
0.40
rungalileo/pii
pkg:data/rungalileo/pii

PII Detection Dataset This dataset is derived from gretelai/synthetic_pii_finance_multilingual, filtered to English samples with PII labels consolidated into 11 standardized categories for evaluating PII detection metrics. Dataset Creation import ast from datasets import load_dataset # Load source dataset and filter to English ds = load_dataset("gretelai/synthetic_pii_finance_multilingual", split="test") ds = ds.filter(lambda x: x["language"] == "English") # Map… See the full description on the dataset page: https://huggingface.co/datasets/rungalileo/pii.

huggingface token-classification apache-2.0 89 회 다운로드
0.40
betterdataai/gliner-multilingual-ner-silver-v1
pkg:data/betterdataai/gliner-multilingual-ner-silver-v1

Betterdata Annotated Multilingual NER/PII Dataset Summary This dataset contains multilingual, annotated NER/PII spans across 13 languages with 60+ label classes spanning PII, PHI, PCI, and general entity types. It is designed to train and evaluate privacy-preserving NER models. Data Sources bloomberg_financial_news_annotated (data/augmented/bloomberg_financial_news_annotated.jsonl) c4_multilingual_annotated (data/augmented/c4_multilingual_annotated.jsonl)… See the full description on the dataset page: https://huggingface.co/datasets/betterdataai/gliner-multilingual-ner-silver-v1.

huggingface token-classification apache-2.0 88 회 다운로드
0.40
ai4privacy/pii-masking-work-pwi-preview
pkg:data/ai4privacy/pii-masking-work-pwi-preview

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. PII Masking Personal Work & HR Information (PWI) — Preview 50 sample entries from the PII-Masking-2M European release by AI4Privacy. Source text and PII values are redacted in this preview. Contact us for full access. Label Distribution Language Distribution European Coverage Full Dataset The… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-work-pwi-preview.

huggingface token-classification cc-by-4.0 86 회 다운로드
0.40
mukunda1729/pii-detection-fixtures
pkg:data/mukunda1729/pii-detection-fixtures

pii-detection-fixtures 25 short text snippets labeled for PII (Personally Identifiable Information) and secrets. Designed as a small, hand-curated fixture set for testing PII redaction pipelines, agent guardrails, and LLM prompt sanitizers. All data is synthetic — no real people, real keys, or real accounts. PII / secret types covered Type Examples in this set email 3 phone 2 ssn 1 dob 1 credit_card 1 address 1 name 2 medical_record 1 passport 1… See the full description on the dataset page: https://huggingface.co/datasets/mukunda1729/pii-detection-fixtures.

huggingface mit 85 회 다운로드
0.40
arthrod/gliner-opf-ptbr-pii-bench-v1
pkg:data/arthrod/gliner-opf-ptbr-pii-bench-v1

PT-BR PII Benchmark v1 — head-to-head Why this exists Open-sourced at the request of @arthrod (Arthur Souza Rodrigues) after a two-night sprint training and benchmarking these models on an AMD MI300X. The motivation: there's surprisingly little published head-to-head data comparing MoE-based PII detectors (openai/privacy-filter) against dense small-model approaches (GLiNER on mmBERT/ettin) on a real-world Portuguese task — and the trade-offs turned out to be sharp enough to be… See the full description on the dataset page: https://huggingface.co/datasets/arthrod/gliner-opf-ptbr-pii-bench-v1.

huggingface agpl-3.0 84 회 다운로드
0.40
fuasfgauighsudghaughdoaughsdughdasughoadhg/pii-masking-300k
pkg:data/fuasfgauighsudghaughdoaughsdughdasughoadhg/pii-masking-300k

Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. Key facts: OpenPII-220k text entries have 27 PII classes (types of sensitive data), targeting 749 discussion subjects / use cases split across education, health, and psychology. FinPII contains an additional ~20 types tailored to… See the full description on the dataset page: https://huggingface.co/datasets/fuasfgauighsudghaughdoaughsdughdasughoadhg/pii-masking-300k.

huggingface text-classification other 82 회 다운로드
0.40
NationalTutoringObservatory/MathEd-PII
pkg:data/NationalTutoringObservatory/MathEd-PII

Dataset Card for MathEd-PII Dataset Summary MathEd-PII is a dataset focused on de-identifying Personally Identifiable Information (PII) within mathematics education and tutoring transcripts. This dataset contains surrogate ground truth data generated from question-anchored, on-demand mathematics tutoring sessions, providing a valuable resource for training and evaluating PII detection and redaction models in educational contexts. Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/NationalTutoringObservatory/MathEd-PII.

huggingface text-generation ['mit', 'cc-by-4.0'] 80 회 다운로드
0.40
jayluxferro/llm-redactor-leak-benchmark
pkg:data/jayluxferro/llm-redactor-leak-benchmark

LLM-Redactor Leak Benchmark A benchmark of 1,300 synthetic prompts with 4,014 ground-truth annotations spanning four workload classes, designed to evaluate privacy-preserving techniques for outbound LLM requests. Released alongside the paper: LLM-Redactor: An Empirical Evaluation of Eight Techniques for Privacy-Preserving LLM Requests Justice Owusu Agyemang, Jerry John Kponyo, Elliot Amponsah, Godfred Manu Addo Boakye, Kwame Opuni-Boachie Obour Agyekum arXiv:2604.12064… See the full description on the dataset page: https://huggingface.co/datasets/jayluxferro/llm-redactor-leak-benchmark.

huggingface token-classification mit 79 회 다운로드
0.40
yusuf-said/turkish-privacy-filter-dataset
pkg:data/yusuf-said/turkish-privacy-filter-dataset

Turkish Privacy Filter Dataset Turkish Privacy Filter Dataset is a Turkish-language dataset for privacy filtering, personally identifiable information (PII) detection, data redaction, and token-classification research. The dataset contains synthetic and curated Turkish text examples with character-level privacy spans. It is designed to support the development and evaluation of models that detect sensitive information in Turkish text, including names, phone numbers, email addresses… See the full description on the dataset page: https://huggingface.co/datasets/yusuf-said/turkish-privacy-filter-dataset.

huggingface token-classification mit 72 회 다운로드
0.40
LocalDoc/pii_ner_azerbaijani_extended
pkg:data/LocalDoc/pii_ner_azerbaijani_extended

PII NER Azerbaijani Extended Dataset Extended version of LocalDoc/pii_ner_azerbaijani — a synthetic Azerbaijani dataset for PII-aware Named Entity Recognition (token classification). This dataset combines three data generation strategies: Template-based — original synthetic data + transliterated variants LLM-generated PII — natural sentences with realistic PII in diverse contexts LLM-generated hard negatives — sentences WITHOUT PII but with tricky look-alike words Note: All… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/pii_ner_azerbaijani_extended.

huggingface token-classification cc-by-4.0 70 회 다운로드
0.40
careons/dutch-healthcare-pii-ner
pkg:data/careons/dutch-healthcare-pii-ner

Dutch Healthcare PII NER Dataset A synthetic Dutch-language Named Entity Recognition (NER) dataset focused on Personally Identifiable Information (PII) detection and anonymization, with a strong emphasis on healthcare contexts in the Netherlands. This dataset is fully synthetic. All texts and entities were generated by AI and do not represent real individuals, organizations, or medical records. Overview Property Value Language Dutch (nl) Samples 400… See the full description on the dataset page: https://huggingface.co/datasets/careons/dutch-healthcare-pii-ner.

huggingface token-classification mit 68 회 다운로드
0.40
shivaniachary123/SPY
pkg:data/shivaniachary123/SPY

SPY: Enhancing Privacy with Synthetic PII Detection Dataset We proudly present the SPY Dataset, a novel synthetic dataset for the task of Personal Identifiable Information (PII) detection. This dataset highlights the importance of safeguarding PII in modern data processing and serves as a benchmark for advancing privacy-preserving technologies. Key Highlights Innovative Generation: We present a methodology for developing the SPY dataset and compare it to other… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/SPY.

huggingface token-classification cc-by-4.0 60 회 다운로드
0.40
ai4privacy/pli-masking-100k
pkg:data/ai4privacy/pli-masking-100k

👉 Looking for the newest release? The current flagship is ai4privacy/pii-masking-openpii-1m. 1.4M samples, 23 languages, 19 PII classes. EPII Personal Location Information (PLI) Masking Preview Dataset Overview This dataset provides a preview (400 samples) of the EPII Personal Location Information (PLI) Masking Dataset, a specialized collection designed for identifying and masking sensitive personal location information within text data. This preview demonstrates the… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pli-masking-100k.

huggingface token-classification other 56 회 다운로드
0.40
zachz/pii-detection-corpus
pkg:data/zachz/pii-detection-corpus

PII Detection Corpus Synthetic dataset of text samples containing labeled PII (Personally Identifiable Information) for testing and benchmarking PII detection/scrubbing tools. Fields text: Text sample containing PII pii_type: Category of PII (email, phone, ssn, credit_card, ip, dob, address, passport, api_key, name, iban) pii_value: The exact PII string in the text start: Character offset start end: Character offset end context: Surrounding context category (medical… See the full description on the dataset page: https://huggingface.co/datasets/zachz/pii-detection-corpus.

huggingface token-classification mit 56 회 다운로드
0.40
shivaniachary123/Nemotron-PII
pkg:data/shivaniachary123/Nemotron-PII

Nemotron-PII: Synthesized Data for Privacy-Preserving AI Dataset Description Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/shivaniachary123/Nemotron-PII.

huggingface token-classification cc-by-4.0 55 회 다운로드
0.40
BTX24/turkish-privacy-pii-ner
pkg:data/BTX24/turkish-privacy-pii-ner

Turkish Privacy PII NER Dataset Repository: BTX24/turkish-privacy-pii-ner Author: Boran ToktayLicense: Creative Commons Attribution 4.0 International (CC BY 4.0) English Turkish Privacy PII NER Dataset is a fully synthetic Turkish named entity recognition dataset for privacy-oriented span detection. It is designed for training and evaluating models that detect personally identifiable information (PII) in Turkish text. The dataset contains Turkish sentences with… See the full description on the dataset page: https://huggingface.co/datasets/BTX24/turkish-privacy-pii-ner.

huggingface token-classification cc-by-4.0 54 회 다운로드
0.40
mukuls9971/address-benchmark-v1
pkg:data/mukuls9971/address-benchmark-v1

Indian Address Benchmark Dataset v1 Mixed benchmark dataset for Indian-address tagging built from synthetic data plus public upstream datasets. Repository Dataset repo: mukuls9971/address-benchmark-v1 Train split: 26728 Validation split: 6158 Test split: 1410 Files train.jsonl validation.jsonl test.jsonl report.json Notes Generated and published by the pii-model-oss workflow. Upstream datasets used to assemble benchmark variants retain their own… See the full description on the dataset page: https://huggingface.co/datasets/mukuls9971/address-benchmark-v1.

huggingface token-classification mit 53 회 다운로드
0.40
Keerthikl/Nemotron-PII
pkg:data/Keerthikl/Nemotron-PII

Nemotron-PII: Synthesized Data for Privacy-Preserving AI Dataset Description Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in… See the full description on the dataset page: https://huggingface.co/datasets/Keerthikl/Nemotron-PII.

huggingface token-classification cc-by-4.0 51 회 다운로드
0.40
cyw123/VHD11K
pkg:data/cyw123/VHD11K

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition Chen Yeh*, You-Ming Chang*, Wei-Chen Chiu, Ning Yu Accepted to NeurIPS'24 Datasets and Benchmarks Track! Overview We propose a comprehensive and extensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum… See the full description on the dataset page: https://huggingface.co/datasets/cyw123/VHD11K.

huggingface zero-shot-classification 38 회 다운로드
0.40
GaborMadarasz/toxic_oscar_hu
pkg:data/GaborMadarasz/toxic_oscar_hu

Total tokens: 453,507,683 Quantiles of text length (in characters) Quantile Length Minimum (0 %) 30 25 % (Q1) 587 50 % (Median) 2,984 75 % (Q3) 8,344 Maximum (100 %) 956,593 Number of texts longer than 13,000 characters: 47,846. ATTENTION! This dataset contains toxic content filtered from the Hungarian section of oscar-corpus/oscar. The filtering was performed using an offensive and toxic dictionary-based algorithm. The dataset contains sexual, offensive, racist… See the full description on the dataset page: https://huggingface.co/datasets/GaborMadarasz/toxic_oscar_hu.

huggingface 17 회 다운로드
0.40
french-open-data/bases-statistiques-communale-departementale-et-regionale-de-la-delinquance-enregistree-par-la-po
pkg:data/french-open-data/bases-statistiques-communale-departementale-et-regionale-de-la-delinquance-enregistree-par-la-po

Bases statistiques communale, départementale et régionale de la délinquance enregistrée par la police et la gendarmerie nationales [!NOTE] Ce jeu de données Hugging Face est vide. Cette carte sert seulement à référencer le jeu de données **Bases statistiques communale, départementale et régionale de la délinquance enregistrée par la police et la gendarmerie nationales ** qui est disponible à l'adresse https://www.data.gouv.fr/datasets/621df2954fa5a3b5a023e23c… See the full description on the dataset page: https://huggingface.co/datasets/french-open-data/bases-statistiques-communale-departementale-et-regionale-de-la-delinquance-enregistree-par-la-po.

huggingface 7 회 다운로드
0.40
french-open-data/toutes-les-enquetes-de-l-ined-depuis-les-annees-1950
pkg:data/french-open-data/toutes-les-enquetes-de-l-ined-depuis-les-annees-1950

Toutes les enquêtes de l'Ined depuis les années 1950 [!NOTE] Ce jeu de données Hugging Face est vide. Cette carte sert seulement à référencer le jeu de données Toutes les enquêtes de l'Ined depuis les années 1950 qui est disponible à l'adresse https://www.data.gouv.fr/datasets/53699443a3a729239d2043d9 Description Toutes les enquêtes de l'Ined réalisées depuis sa création : choix du conjoint, familles et employeurs, violences faites aux femmes, personnes sans domicile… See the full description on the dataset page: https://huggingface.co/datasets/french-open-data/toutes-les-enquetes-de-l-ined-depuis-les-annees-1950.

huggingface 7 회 다운로드
0.40
french-open-data/indicateurs-annuels-de-la-victimation-et-du-sentiment-dinsecurite-issus-des-enquetes-cadre-de-vi
pkg:data/french-open-data/indicateurs-annuels-de-la-victimation-et-du-sentiment-dinsecurite-issus-des-enquetes-cadre-de-vi

Indicateurs annuels de la victimation et du sentiment d’insécurité issus des enquêtes Cadre de Vie et Sécurité [!NOTE] Ce jeu de données Hugging Face est vide. Cette carte sert seulement à référencer le jeu de données Indicateurs annuels de la victimation et du sentiment d’insécurité issus des enquêtes Cadre de Vie et Sécurité qui est disponible à l'adresse https://www.data.gouv.fr/datasets/62029493ff5c2e6d510b42b6 Description Visualisez les résultats détaillés sur… See the full description on the dataset page: https://huggingface.co/datasets/french-open-data/indicateurs-annuels-de-la-victimation-et-du-sentiment-dinsecurite-issus-des-enquetes-cadre-de-vi.

huggingface 6 회 다운로드
0.35
appledora/DANGA
pkg:data/appledora/DANGA

[!WARNING] Content Warning: This dataset contains violent, hateful, and severely offensive language in Bengali, including communal slurs, dehumanizing rhetoric, threats, and incitement to violence targeting religious, ethnic, and cultural communities. It is intended solely for research purposes (hate speech detection, content moderation, NLP). Do not use this dataset to generate, promote, or amplify harmful content. BanDANGA: A Bangla Dataset on Aggressive Narratives and… See the full description on the dataset page: https://huggingface.co/datasets/appledora/DANGA.

huggingface text-classification cc-by-sa-4.0 46 회 다운로드
0.35
Jony7chu/HarmfulQA
pkg:data/Jony7chu/HarmfulQA

HarmfulQA The preliminary version (gated access) has been released.The preliminary version (gated access) will be released before September 8, 2025. ⚠️ Warning:This dataset contains harmful, offensive, or otherwise unsafe question–answer pairs.Access is gated – you must request access, provide institutional credentials, and agree to the Data Use Agreement (DUA) before downloading. Dataset Summary HarmfulQA is a dataset of 50 harmful questions, each paired with: A… See the full description on the dataset page: https://huggingface.co/datasets/Jony7chu/HarmfulQA.

huggingface question-answering other 22 회 다운로드
0.35
TrustAIRLab/HarmfulQA
pkg:data/TrustAIRLab/HarmfulQA

HarmfulQA The preliminary version (gated access) has been released.The preliminary version (gated access) will be released before September 8, 2025. ⚠️ Warning:This dataset contains harmful, offensive, or otherwise unsafe question–answer pairs.Access is gated – you must request access, provide institutional credentials, and agree to the Data Use Agreement (DUA) before downloading. Dataset Summary HarmfulQA is a dataset of 50 harmful questions, each paired with: A… See the full description on the dataset page: https://huggingface.co/datasets/TrustAIRLab/HarmfulQA.

huggingface question-answering other 17 회 다운로드
0.35
TrustAIRLab/JailbreakQR
pkg:data/TrustAIRLab/JailbreakQR

JailbreakQR The preliminary version (gated access) has been released.The preliminary version (gated access) will be released before September 8, 2025. ⚠️ Warning:This dataset contains harmful, offensive, or otherwise unsafe question–answer pairs.Access is gated – you must request access, provide institutional email address OR ORCID, and agree to the Data Use Agreement (DUA) before downloading. Dataset Summary JailbreakQR is a dataset of 400 pairs of jailbreak prompts… See the full description on the dataset page: https://huggingface.co/datasets/TrustAIRLab/JailbreakQR.

huggingface question-answering other 16 회 다운로드
0.35
Jony7chu/JailbreakQR
pkg:data/Jony7chu/JailbreakQR

JailbreakQR The preliminary version (gated access) has been released. The preliminary version (gated access) will be released before September 8, 2025. ⚠️ Warning:This dataset contains harmful, offensive, or otherwise unsafe question–answer pairs.Access is gated – you must request access, provide institutional email address OR ORCID, and agree to the Data Use Agreement (DUA) before downloading. Dataset Summary JailbreakQR is a dataset of 400 pairs of jailbreak… See the full description on the dataset page: https://huggingface.co/datasets/Jony7chu/JailbreakQR.

huggingface question-answering other 15 회 다운로드
0.35
gus-mxx/A1-Machine-Learning-Data-Challenge
pkg:data/gus-mxx/A1-Machine-Learning-Data-Challenge

Dataset Card for Violence Detection Dataset This dataset has been created in February 2026 for the Machile Learning Data Challenge Assignment from the course Unboxing the Algorithm at Erasmus University Rotterdam. The goal of the assignment is to select a problem that is societally relevant, hence we decided to train a ML model to identify violent imagery, with the aim that individuals involved in content moderation would be less exposed to this kind of harmful content that is… See the full description on the dataset page: https://huggingface.co/datasets/gus-mxx/A1-Machine-Learning-Data-Challenge.

huggingface image-classification other 13 회 다운로드
0.30
denny3388/VHD11K
pkg:data/denny3388/VHD11K

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition Chen Yeh*, You-Ming Chang*, Wei-Chen Chiu, Ning Yu Accepted to NeurIPS'24 Datasets and Benchmarks Track! Overview We propose a comprehensive and extensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum… See the full description on the dataset page: https://huggingface.co/datasets/denny3388/VHD11K.

huggingface zero-shot-classification 109 회 다운로드
0.30
byroneverson/abliterate-refusal
pkg:data/byroneverson/abliterate-refusal

Dataset for abliterating refusal in large language models Contains "harmful" prompts where "target" field is true, and "harmless" prompts where false. Credit: https://github.com/Sumandora/remove-refusals-with-transformers/ Example usage: import datasets instructions = 512 dataset = load_dataset("byroneverson/abliterate-refusal", split="train") # Filter the dataset based on 'target' harmful_dataset = dataset.filter(lambda x: x['target'] == True) harmless_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/byroneverson/abliterate-refusal.

huggingface feature-extraction 81 회 다운로드
0.30
lenML/abliterate-refusal-cn
pkg:data/lenML/abliterate-refusal-cn

我使用本地模型将其翻译为中文，为了减少使用 "abliterator" 脚本时对llm中文能力的损害 Dataset for abliterating refusal in large language models Contains "harmful" prompts where "is_harmful" field is true, and "harmless" prompts where false. Credit: https://github.com/Sumandora/remove-refusals-with-transformers/ Source repo: https://huggingface.co/datasets/byroneverson/abliterate-refusal Example usage: import datasets instructions = 512 dataset = load_dataset("lenML/abliterate-refusal-cn"… See the full description on the dataset page: https://huggingface.co/datasets/lenML/abliterate-refusal-cn.

huggingface feature-extraction 72 회 다운로드
0.30
appledora/DANGA-Adapted
pkg:data/appledora/DANGA-Adapted

[!WARNING] Content Warning: This dataset contains violent, hateful, and severely offensive language in Bengali, including communal slurs, dehumanizing rhetoric, threats, and incitement to violence targeting religious, ethnic, and cultural communities. It is intended solely for research purposes (hate speech detection, content moderation, NLP). Do not use this dataset to generate, promote, or amplify harmful content. [!NOTE] This dataset has been released as part of the Adaption Competition… See the full description on the dataset page: https://huggingface.co/datasets/appledora/DANGA-Adapted.

huggingface cc-by-nc-sa-4.0 46 회 다운로드
0.30
ud-smart-city/fight-detection-video
pkg:data/ud-smart-city/fight-detection-video

Fight Dataset - 1,000+ videos This dataset contains 1,000 high-quality videos of simulated physical altercations recorded in controlled environments, captured from static and moving surveillance camera views at up to 1920×1080 resolution and 30 FPS. Designed for violence detection, action recognition, and public safety systems, this surveillance dataset includes rich metadata annotations enabling accurate camera fight analysis and training violence detection models.- Get the data… See the full description on the dataset page: https://huggingface.co/datasets/ud-smart-city/fight-detection-video.

huggingface object-detection cc-by-nc-nd-4.0 37 회 다운로드
0.30
abullard1/germeval-2025-harmful-content-detection-training-dataset
pkg:data/abullard1/germeval-2025-harmful-content-detection-training-dataset

GermEval 2025 Harmful Content Detection - Training Sets (Call to Action • Attacks on Democratic Basic Order • Violence) Author: Samuel Ruairí Bullard - University of Regensburg Models: Model Zoo (Gradio Space) Base model: LSX-UniWue/ModernGBERT_134M Competition: GermEval 2025 Shared Task Collection: GermEval 2025 Contribution CollectionabullardUR@GermEval Shared Task 2025 Submission Dataset Summary This repository republishes the training splits used… See the full description on the dataset page: https://huggingface.co/datasets/abullard1/germeval-2025-harmful-content-detection-training-dataset.

huggingface text-classification gpl-3.0 32 회 다운로드
0.30
istiakshihab/DANGA
pkg:data/istiakshihab/DANGA

👹 BanDANGA: A Bangla Dataset on Aggressive Narratives and Group-based Attacks 👹

huggingface text-classification cc-by-sa-4.0 31 회 다운로드
0.30
SimpleSam/wajibika
pkg:data/SimpleSam/wajibika

Dataset Card for Wajibika Wajibika is an action recognition data set of social misconduct by government and other chosen officials collected from TikTok and contributions by everyone willing to contribute to change.

huggingface apache-2.0 26 회 다운로드
0.30
wei258/HatefulIllusion_Dataset
pkg:data/wei258/HatefulIllusion_Dataset

[Disclaimer] This dataset contains harmful content and can only be used for research or educational purposes! Dataset Description This dataset is generated and used in the paper: Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions (ICCV 2025) It contains 2,160 (hateful) AI-generated optical illusions that hide three types of messages: digits: 10 messages, 300 AI-generated illusions hate slangs (hate speech): 23 messages, 690 AI-generated illusions hate… See the full description on the dataset page: https://huggingface.co/datasets/wei258/HatefulIllusion_Dataset.

huggingface mit 26 회 다운로드
0.30
3iS/violence-and-conflict-events-in-colombia
pkg:data/3iS/violence-and-conflict-events-in-colombia

Humanitarian Dataset: Violence and IHL Violations in Colombia (2024) Overview This dataset contains records of violence and infractions to international humanitarian law (IHL) in Colombia during 2024. The dataset was compiled by OCHA Colombia from reports by key informants and news sources. The data has been structured and categorized according to IHL standards, including the extraction of the number of victims and events. Content The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/3iS/violence-and-conflict-events-in-colombia.

huggingface text-classification mit 17 회 다운로드
0.25
qualifire/safety-benchmark
pkg:data/qualifire/safety-benchmark

Safety Classification Dataset Dataset Summary This dataset is designed for multi-label classification of text inputs, identifying whether they contain safety-related concerns. Each sample is labeled with one or more of the following categories: Dangerous Content Harassment Sexually Explicit Information Hate Speech Safe This Dataset contain 5000 samples. Labeling Rules If Safe = 0, at least one of the other labels (Dangerous Content, Harassment, Sexually… See the full description on the dataset page: https://huggingface.co/datasets/qualifire/safety-benchmark.

huggingface cc-by-nc-4.0 86 회 다운로드
0.25
liyang-ict/FineHarm
pkg:data/liyang-ict/FineHarm

Dataset Card for FineHarm We construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations towards harmfulness to provide reasonable supervision for token-level training. Dataset Details Dataset Sources Repository: ICTMCG/SCM Paper: NeurIPS 2025 ArXiv: 2506.09996 Demo: Prpject Page Intended Uses Moderation tool: FineHarm is intended to be used for content moderation, specifically for classifying harmful… See the full description on the dataset page: https://huggingface.co/datasets/liyang-ict/FineHarm.

huggingface text-classification mit 63 회 다운로드
0.25
burkimbia/speech-dataset-clean
pkg:data/burkimbia/speech-dataset-clean

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/stabilityai__StableBeluga2-details
pkg:data/open-llm-leaderboard/stabilityai__StableBeluga2-details

Dataset Card for Evaluation run of stabilityai/StableBeluga2 Dataset automatically created during the evaluation run of model stabilityai/StableBeluga2 The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/stabilityai__StableBeluga2-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/mistralai__Mistral-7B-v0.3-details
pkg:data/open-llm-leaderboard/mistralai__Mistral-7B-v0.3-details

Dataset Card for Evaluation run of mistralai/Mistral-7B-v0.3 Dataset automatically created during the evaluation run of model mistralai/Mistral-7B-v0.3 The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 3 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/mistralai__Mistral-7B-v0.3-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/Qwen__Qwen2.5-7B-Instruct-details
pkg:data/open-llm-leaderboard/Qwen__Qwen2.5-7B-Instruct-details

Dataset Card for Evaluation run of Qwen/Qwen2.5-7B-Instruct Dataset automatically created during the evaluation run of model Qwen/Qwen2.5-7B-Instruct The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2.5-7B-Instruct-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/abacusai__Dracarys-72B-Instruct-details
pkg:data/open-llm-leaderboard/abacusai__Dracarys-72B-Instruct-details

Dataset Card for Evaluation run of abacusai/Dracarys-72B-Instruct Dataset automatically created during the evaluation run of model abacusai/Dracarys-72B-Instruct The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/abacusai__Dracarys-72B-Instruct-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/DeepMount00__Qwen2-1.5B-Ita_v2-details
pkg:data/open-llm-leaderboard/DeepMount00__Qwen2-1.5B-Ita_v2-details

Dataset Card for Evaluation run of DeepMount00/Qwen2-1.5B-Ita_v2 Dataset automatically created during the evaluation run of model DeepMount00/Qwen2-1.5B-Ita_v2 The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/DeepMount00__Qwen2-1.5B-Ita_v2-details.

huggingface 49 회 다운로드
0.25
simulaXrm/stroke-800-HTR
pkg:data/simulaXrm/stroke-800-HTR

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/Qwen__Qwen1.5-0.5B-Chat-details
pkg:data/open-llm-leaderboard/Qwen__Qwen1.5-0.5B-Chat-details

Dataset Card for Evaluation run of Qwen/Qwen1.5-0.5B-Chat Dataset automatically created during the evaluation run of model Qwen/Qwen1.5-0.5B-Chat The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen1.5-0.5B-Chat-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/MaziyarPanahi__calme-2.1-rys-78b-details
pkg:data/open-llm-leaderboard/MaziyarPanahi__calme-2.1-rys-78b-details

Dataset Card for Evaluation run of MaziyarPanahi/calme-2.1-rys-78b Dataset automatically created during the evaluation run of model MaziyarPanahi/calme-2.1-rys-78b The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/MaziyarPanahi__calme-2.1-rys-78b-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/internlm__internlm2_5-20b-chat-details
pkg:data/open-llm-leaderboard/internlm__internlm2_5-20b-chat-details

Dataset Card for Evaluation run of internlm/internlm2_5-20b-chat Dataset automatically created during the evaluation run of model internlm/internlm2_5-20b-chat The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/internlm__internlm2_5-20b-chat-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/mistralai__Mistral-7B-Instruct-v0.3-details
pkg:data/open-llm-leaderboard/mistralai__Mistral-7B-Instruct-v0.3-details

Dataset Card for Evaluation run of mistralai/Mistral-7B-Instruct-v0.3 Dataset automatically created during the evaluation run of model mistralai/Mistral-7B-Instruct-v0.3 The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/mistralai__Mistral-7B-Instruct-v0.3-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/Qwen__Qwen1.5-0.5B-details
pkg:data/open-llm-leaderboard/Qwen__Qwen1.5-0.5B-details

Dataset Card for Evaluation run of Qwen/Qwen1.5-0.5B Dataset automatically created during the evaluation run of model Qwen/Qwen1.5-0.5B The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen1.5-0.5B-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/microsoft__Phi-3-medium-4k-instruct-details
pkg:data/open-llm-leaderboard/microsoft__Phi-3-medium-4k-instruct-details

Dataset Card for Evaluation run of microsoft/Phi-3-medium-4k-instruct Dataset automatically created during the evaluation run of model microsoft/Phi-3-medium-4k-instruct The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft__Phi-3-medium-4k-instruct-details.

huggingface 49 회 다운로드
0.25
open-llm-leaderboard/Qwen__Qwen1.5-7B-Chat-details
pkg:data/open-llm-leaderboard/Qwen__Qwen1.5-7B-Chat-details

Dataset Card for Evaluation run of Qwen/Qwen1.5-7B-Chat Dataset automatically created during the evaluation run of model Qwen/Qwen1.5-7B-Chat The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen1.5-7B-Chat-details.

huggingface 49 회 다운로드
0.25
justincnn/j2nebJktUD-erlnZ43
pkg:data/justincnn/j2nebJktUD-erlnZ43

huggingface 48 회 다운로드
0.25
TheFinAI/flare-cd
pkg:data/TheFinAI/flare-cd

Dataset Card for "flare-cd" More Information needed

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/microsoft__phi-2-details
pkg:data/open-llm-leaderboard/microsoft__phi-2-details

Dataset Card for Evaluation run of microsoft/phi-2 Dataset automatically created during the evaluation run of model microsoft/phi-2 The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft__phi-2-details.

huggingface 48 회 다운로드
0.25
viral-data-safety/sequence_with_tiers
pkg:data/viral-data-safety/sequence_with_tiers

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/Intel__neural-chat-7b-v3-2-details
pkg:data/open-llm-leaderboard/Intel__neural-chat-7b-v3-2-details

Dataset Card for Evaluation run of Intel/neural-chat-7b-v3-2 Dataset automatically created during the evaluation run of model Intel/neural-chat-7b-v3-2 The dataset is composed of 78 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 3 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Intel__neural-chat-7b-v3-2-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/BEE-spoke-data__smol_llama-220M-GQA-fineweb_edu-details
pkg:data/open-llm-leaderboard/BEE-spoke-data__smol_llama-220M-GQA-fineweb_edu-details

Dataset Card for Evaluation run of BEE-spoke-data/smol_llama-220M-GQA-fineweb_edu Dataset automatically created during the evaluation run of model BEE-spoke-data/smol_llama-220M-GQA-fineweb_edu The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/BEE-spoke-data__smol_llama-220M-GQA-fineweb_edu-details.

huggingface 48 회 다운로드
0.25
anyantudre/moore-speech-bible
pkg:data/anyantudre/moore-speech-bible

Moore Speech Bible: A Curated Audio-Text Dataset for Mooré TTS and ASR The Moore Speech Bible dataset is a collection of aligned audio and text in Mooré, gathered from publicly available religious sources. This corpus is curated for research and academic purposes in low-resource speech and language processing, especially for text-to-speech (TTS) and automatic speech recognition (ASR) in the Mooré language (ISO 639-3: mos). Mooré remains under-represented in current speech… See the full description on the dataset page: https://huggingface.co/datasets/anyantudre/moore-speech-bible.

huggingface 48 회 다운로드
0.25
kunli-cs/MA52_pyskl
pkg:data/kunli-cs/MA52_pyskl

Skeleton Data for Micro-Action 52 dataset Introduction This repository is designed specifically for Skeleton-based Micro-Action Recognition research. The Micro-Action-52 (MA-52) dataset is only to be used for non-commercial scientific purposes. Please note that the test set is withheld for competition purposes. You can evaluate your results by following the provided instructions. 28 keypoints extracted by OpenPose self.inward = [ (4, 3), (3, 2), (7, 6), (6, 5)… See the full description on the dataset page: https://huggingface.co/datasets/kunli-cs/MA52_pyskl.

huggingface 48 회 다운로드
0.25
stukenov/sozkz-corpus-pretrain-gec-mix-v1
pkg:data/stukenov/sozkz-corpus-pretrain-gec-mix-v1

huggingface 48 회 다운로드
0.25
fw407/vtab-1k_dtd
pkg:data/fw407/vtab-1k_dtd

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/migtissera__Llama-3-70B-Synthia-v3.5-details
pkg:data/open-llm-leaderboard/migtissera__Llama-3-70B-Synthia-v3.5-details

Dataset Card for Evaluation run of migtissera/Llama-3-70B-Synthia-v3.5 Dataset automatically created during the evaluation run of model migtissera/Llama-3-70B-Synthia-v3.5 The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/migtissera__Llama-3-70B-Synthia-v3.5-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/alpindale__WizardLM-2-8x22B-details
pkg:data/open-llm-leaderboard/alpindale__WizardLM-2-8x22B-details

Dataset Card for Evaluation run of alpindale/WizardLM-2-8x22B Dataset automatically created during the evaluation run of model alpindale/WizardLM-2-8x22B The dataset is composed of 43 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/alpindale__WizardLM-2-8x22B-details.

huggingface 48 회 다운로드
0.25
RISys-Lab/RedSage-Seed
pkg:data/RISys-Lab/RedSage-Seed

Dataset Card for RedSage-Seed "RedSage: A Cybersecurity Generalist LLM" (ICLR 2026) Authors: Naufal Suryanto1*, Muzammal Naseer1, Pengfei Li1, Syed Talal Wasim2, Jinhui Yi2, Juergen Gall2, Paolo Ceravolo3, Ernesto Damiani3 1Khalifa University, 2University of Bonn, 3University of Milan *Project Lead 🌐 Project Page | 🤖 Model Collection | 📊 Benchmark Collection | 📘 Data Collection Dataset Summary RedSage-Seed is a… See the full description on the dataset page: https://huggingface.co/datasets/RISys-Lab/RedSage-Seed.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/THUDM__glm-4-9b-chat-details
pkg:data/open-llm-leaderboard/THUDM__glm-4-9b-chat-details

Dataset Card for Evaluation run of THUDM/glm-4-9b-chat Dataset automatically created during the evaluation run of model THUDM/glm-4-9b-chat The dataset is composed of 43 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 5 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/THUDM__glm-4-9b-chat-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/AI-Sweden-Models__gpt-sw3-40b-details
pkg:data/open-llm-leaderboard/AI-Sweden-Models__gpt-sw3-40b-details

Dataset Card for Evaluation run of AI-Sweden-Models/gpt-sw3-40b Dataset automatically created during the evaluation run of model AI-Sweden-Models/gpt-sw3-40b The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/AI-Sweden-Models__gpt-sw3-40b-details.

huggingface 48 회 다운로드
0.25
RhapsodyAI/UltraVL
pkg:data/RhapsodyAI/UltraVL

huggingface visual-question-answering 48 회 다운로드
0.25
open-llm-leaderboard/Alibaba-NLP__gte-Qwen2-7B-instruct-details
pkg:data/open-llm-leaderboard/Alibaba-NLP__gte-Qwen2-7B-instruct-details

Dataset Card for Evaluation run of Alibaba-NLP/gte-Qwen2-7B-instruct Dataset automatically created during the evaluation run of model Alibaba-NLP/gte-Qwen2-7B-instruct The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Alibaba-NLP__gte-Qwen2-7B-instruct-details.

huggingface 48 회 다운로드
0.25
nyu-dice-lab/allenai_WildChat-1M-Full-meta-llama_Llama-3.1-8B-Instruct
pkg:data/nyu-dice-lab/allenai_WildChat-1M-Full-meta-llama_Llama-3.1-8B-Instruct

huggingface 48 회 다운로드
0.25
chrisjay/crowd-speech-africa
pkg:data/chrisjay/crowd-speech-africa

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/dnhkng__RYS-XLarge-details
pkg:data/open-llm-leaderboard/dnhkng__RYS-XLarge-details

Dataset Card for Evaluation run of dnhkng/RYS-XLarge Dataset automatically created during the evaluation run of model dnhkng/RYS-XLarge The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/dnhkng__RYS-XLarge-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/Qwen__Qwen1.5-7B-details
pkg:data/open-llm-leaderboard/Qwen__Qwen1.5-7B-details

Dataset Card for Evaluation run of Qwen/Qwen1.5-7B Dataset automatically created during the evaluation run of model Qwen/Qwen1.5-7B The dataset is composed of 81 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 10 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen1.5-7B-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/BEE-spoke-data__smol_llama-220M-GQA-details
pkg:data/open-llm-leaderboard/BEE-spoke-data__smol_llama-220M-GQA-details

Dataset Card for Evaluation run of BEE-spoke-data/smol_llama-220M-GQA Dataset automatically created during the evaluation run of model BEE-spoke-data/smol_llama-220M-GQA The dataset is composed of 43 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/BEE-spoke-data__smol_llama-220M-GQA-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/Changgil__K2S3-v0.1-details
pkg:data/open-llm-leaderboard/Changgil__K2S3-v0.1-details

Dataset Card for Evaluation run of Changgil/K2S3-v0.1 Dataset automatically created during the evaluation run of model Changgil/K2S3-v0.1 The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Changgil__K2S3-v0.1-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/meta-llama__Llama-3.1-70B-details
pkg:data/open-llm-leaderboard/meta-llama__Llama-3.1-70B-details

Dataset Card for Evaluation run of meta-llama/Llama-3.1-70B Dataset automatically created during the evaluation run of model meta-llama/Llama-3.1-70B The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Llama-3.1-70B-details.

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/EleutherAI__gpt-j-6b-details
pkg:data/open-llm-leaderboard/EleutherAI__gpt-j-6b-details

Dataset Card for Evaluation run of EleutherAI/gpt-j-6b Dataset automatically created during the evaluation run of model EleutherAI/gpt-j-6b The dataset is composed of 111 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 12 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/EleutherAI__gpt-j-6b-details.

huggingface 48 회 다운로드
0.25
OpenSportsLab/soccernetpro-classification-skeleton
pkg:data/OpenSportsLab/soccernetpro-classification-skeleton

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/TIGER-Lab__AceCoder-Qwen2.5-Coder-7B-Ins-Rule-details
pkg:data/open-llm-leaderboard/TIGER-Lab__AceCoder-Qwen2.5-Coder-7B-Ins-Rule-details

Dataset Card for Evaluation run of TIGER-Lab/AceCoder-Qwen2.5-Coder-7B-Ins-Rule Dataset automatically created during the evaluation run of model TIGER-Lab/AceCoder-Qwen2.5-Coder-7B-Ins-Rule The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/TIGER-Lab__AceCoder-Qwen2.5-Coder-7B-Ins-Rule-details.

huggingface 48 회 다운로드
0.25
wanglab/BioReasonCell
pkg:data/wanglab/BioReasonCell

huggingface 48 회 다운로드
0.25
BierLee/stack-100gb-subset
pkg:data/BierLee/stack-100gb-subset

huggingface 48 회 다운로드
0.25
BarryFutureman/dpsk-oc-traces
pkg:data/BarryFutureman/dpsk-oc-traces

huggingface 48 회 다운로드
0.25
open-llm-leaderboard/DeepMount00__Qwen2-1.5B-Ita-details
pkg:data/open-llm-leaderboard/DeepMount00__Qwen2-1.5B-Ita-details

Dataset Card for Evaluation run of DeepMount00/Qwen2-1.5B-Ita Dataset automatically created during the evaluation run of model DeepMount00/Qwen2-1.5B-Ita The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/DeepMount00__Qwen2-1.5B-Ita-details.

huggingface 48 회 다운로드
0.25
siyah1/Malayalam-TTS-v2
pkg:data/siyah1/Malayalam-TTS-v2

huggingface 47 회 다운로드
0.25
Zaynoid/querybot-sft
pkg:data/Zaynoid/querybot-sft

huggingface 47 회 다운로드
0.25
open-llm-leaderboard/Intel__neural-chat-7b-v3-3-details
pkg:data/open-llm-leaderboard/Intel__neural-chat-7b-v3-3-details

Dataset Card for Evaluation run of Intel/neural-chat-7b-v3-3 Dataset automatically created during the evaluation run of model Intel/neural-chat-7b-v3-3 The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Intel__neural-chat-7b-v3-3-details.

huggingface 47 회 다운로드
0.25
open-llm-leaderboard/google__gemma-7b-details
pkg:data/open-llm-leaderboard/google__gemma-7b-details

Dataset Card for Evaluation run of google/gemma-7b Dataset automatically created during the evaluation run of model google/gemma-7b The dataset is composed of 78 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 7 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/google__gemma-7b-details.

huggingface 47 회 다운로드