When to Halt a Migration
400,000+ documents profiled, 92% scope reduction, migration halted—sometimes stopping is the right answer
The Problem
Why This Migration?
Pharmaceutical companies must maintain Trial Master Files (TMFs) for regulatory inspections. The client’s legacy P-Drive file share had accumulated documents over years of clinical trials—mixed with drafts, duplicates, and non-essential files—lacking the structure required for inspection readiness.
| Risk | Impact |
|---|---|
| Regulatory inspection | Disorganized TMF = critical audit findings |
| Document accessibility | Inspectors need fast, accurate retrieval |
| Compliance | GxP requirements demand complete audit trails |
| Operational burden | Post-migration cleanup consumes regulatory team capacity |
The Hidden Problem
The project was scoped without insights into the actual data. Initial estimates assumed folder structures reflected content reality. They did not.
The Solution
Approach: Profile First, Validate Thoroughly
Traditional migration: estimate scope → migrate everything → clean up after.
Our approach: Profile first → validate thoroughly → proceed only if safe.
WHAT WE EXPECTED WHAT WE FOUND
──────────────── ──────────────
400,000 documents Massive duplication
↓ ↓
Profile and descope Misfiled across studies
↓ ↓
Migrate ~32,000 SME validation insufficient
↓ ↓
Done ⛔ Risk too high → HALT
Data Processing Pipeline
┌─────────────────────────────────────────────────────────────┐
│ SOURCE: P-DRIVE │
│ (400,000+ documents initial estimate) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: DATA PROFILING │
│ - Directory structure analysis │
│ - File type/format inventory │
│ - Statistical sampling for quality assessment │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: SCOPE DETERMINATION │
│ - In-scope vs. out-of-scope classification │
│ - Duplicate detection (cross-CRO, cross-study) │
│ - SME validation of borderline cases │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 3: ML-ASSISTED CLASSIFICATION │
│ - Document categorization against TMF Reference Model │
│ - Confidence scoring for manual review prioritization │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 4: DRY RUN VALIDATION │
│ - Migration to Veeva Vault eTMF Sandbox │
│ - Sample verification against source │
│ - Gap analysis and risk assessment │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ⛔ MIGRATION HALTED │
│ - Critical gaps identified in dry run │
│ - Even SME-validated samples contained errors │
│ - Risk deemed too high to proceed │
└─────────────────────────────────────────────────────────────┘
What Profiling Revealed
| Finding | Impact |
|---|---|
| Duplicate documents | Same documents across CRO and client copies |
| Misfiled documents | Documents in wrong study folders |
| Non-TMF content | Personal checklists, draft notes, incomplete documents |
| Metadata gaps | Missing dates, versions, study assignments |
| Untrustworthy structure | File names and folders did not reflect actual content |
Critical insight: During 3 weeks of sample review, only 1 document was identified as actually requiring migration from certain folders.
Implementation
My Role
| Area | Activities |
|---|---|
| Technical leadership | Migration methodology, data profiling approach, quality criteria |
| Team coordination | Data scientists, ML engineers, SMEs |
| Stakeholder management | Client project leads, subject matter experts |
| Risk management | Scope control, timeline, quality assurance |
| Compliance | GxP validation, audit trail requirements |
Team Structure
- Client team: Project leads, subject matter experts
- Consulting SMEs: Scoping specialists
- Technical team: Data scientists, ML engineers
Technology Stack
| Component | Technology |
|---|---|
| Migration Framework | Proprietary migration framework |
| Target System | Veeva Vault eTMF |
| Classification | ML/AI algorithms for document categorization |
| Duplicate Detection | Content hashing + similarity scoring |
| Infrastructure | GxP (TIER1) compliant, locally deployed migration server |
The Halt Decision
Dry Run Results
First dry-run migration to Veeva Vault eTMF Sandbox revealed:
| Issue | Severity | Implication |
|---|---|---|
| SME-validated samples contained errors | Critical | Validation process itself was insufficient |
| Significant metadata gaps | High | Post-migration remediation would be massive |
| Classification mismatches | High | Documents would be filed incorrectly in Vault |
| Audit trail concerns | Critical | Regulatory inspection risk |
The Business Calculation
| Option | Risk | Cost |
|---|---|---|
| Proceed with migration | Regulatory findings, SOP violations, audit risk | High migration cost + massive remediation |
| Halt migration | Documents remain on P-Drive | Profiling investment (sunk) |
Decision: The risk of migrating unreliable data into a regulated system exceeded the benefit. We recommended halting.
The Safer Path Forward
- Keep documents on P-Drive (known state)
- Complete parallel CRO migrations (cleaner data sources)
- Perform TMF completeness check against Vault
- Identify specific missing documents from P-Drive
- Migrate only verified, needed documents on demand
Results
Scope Reduction
| Stage | Document Count | Reduction |
|---|---|---|
| Initial estimate | 400,000+ | — |
| After initial profiling | ~180,000 | -55% |
| After duplicate removal | ~80,000 | -56% |
| After in-scope filtering | ~32,000 | -60% |
| Final decision | 0 migrated | Migration halted |
Business Impact
| Outcome | Value |
|---|---|
| Risk avoided | Regulatory inspection findings prevented |
| Cost avoided | Massive post-migration remediation |
| Compliance maintained | No SOP violations from migrating unverified content |
| Clarity delivered | True state of P-Drive data now understood |
Lessons Learned
-
Never scope a migration without profiling the data. This project was scoped based on folder structure, not content reality.
-
File names and folder structures are not trustworthy. Documents are routinely misfiled, mislabeled, and duplicated.
-
SME validation is necessary but not sufficient. Even human-validated samples contained errors that only surfaced in system testing.
-
ML inherits source data problems. If training data is mislabeled, the model learns to mislabel.
-
Sometimes the best migration is no migration. The courage to halt is harder than the momentum to proceed.
-
Dry runs are essential, not optional. Sandbox validation caught what profiling and SME review missed.
Impact
By halting the migration, the client avoided:
- Regulatory risk: Migrating mislabeled documents into a GxP-validated system would have created audit findings
- Remediation cost: Post-migration cleanup of 32,000+ documents would have consumed the regulatory team for months
- Operational chaos: Inspectors finding documents in wrong study folders during an audit
The profiling investment wasn’t wasted—it revealed the true state of the P-Drive and informed a safer path forward: complete parallel CRO migrations first, then pull only verified missing documents on demand.
Want to discuss document migration?
Facing a legacy migration with uncertain data quality? I can help you profile the risk before you commit—and know when to proceed versus when to stop. Get in touch.