When to Halt a Migration
400,000+ documents profiled, 92% scope reduction, migration halted - sometimes stopping is the right answer
- Client
- mRNA biotech company
- Industry
- Biotechnology / Pharmaceutical
- Period
- April 2024 - October 2024
- Role
- Track Lead
The Problem
Why This Migration?
Pharmaceutical companies must maintain Trial Master Files (TMFs) for regulatory inspections. The client’s legacy P-Drive file share had accumulated documents over years of clinical trials - mixed with drafts, duplicates, and non-essential files - lacking the structure required for inspection readiness.
| Risk | Impact |
|---|---|
| Regulatory inspection | Disorganized TMF = critical audit findings |
| Document accessibility | Inspectors need fast, accurate retrieval |
| Compliance | GxP (Good Practice) requirements demand complete audit trails |
| Operational burden | Post-migration cleanup consumes regulatory team capacity |
The Hidden Problem
The project was scoped without insights into the actual data. Initial estimates assumed folder structures reflected content reality. They did not.
The Solution
Approach: Profile First, Validate Thoroughly
Traditional migration: estimate scope → migrate everything → clean up after.
Our approach: Profile first → validate thoroughly → proceed only if safe.
Data Processing Pipeline
What Profiling Revealed
| Finding | Impact |
|---|---|
| Duplicate documents | Same documents across CRO (Contract Research Organization) and client copies |
| Misfiled documents | Documents in wrong study folders |
| Non-TMF content | Personal checklists, draft notes, incomplete documents |
| Metadata gaps | Missing dates, versions, study assignments |
| Untrustworthy structure | File names and folders did not reflect actual content |
Critical insight: During 3 weeks of sample review, only 1 document was identified as actually requiring migration from certain folders.
Implementation
My Role
| Area | Activities |
|---|---|
| Technical leadership | Migration methodology, data profiling approach, quality criteria |
| Team coordination | Data scientists, ML engineers, SMEs |
| Stakeholder management | Client project leads, subject matter experts |
| Risk management | Scope control, timeline, quality assurance |
| Compliance | GxP validation, audit trail requirements |
Team Structure
- Client team: Project leads, subject matter experts (SMEs)
- Consulting SMEs: Scoping specialists
- Technical team: Data scientists, ML engineers
Technology Stack
| Component | Technology |
|---|---|
| Migration Framework | Proprietary migration framework |
| Target System | Veeva Vault eTMF (electronic Trial Master File) |
| Classification | ML/AI algorithms for document categorization |
| Duplicate Detection | Content hashing + similarity scoring |
| Infrastructure | GxP (TIER1) compliant, locally deployed migration server |
The Halt Decision
Dry Run Results
First dry-run migration to Veeva Vault eTMF Sandbox revealed:
| Issue | Severity | Implication |
|---|---|---|
| SME-validated samples contained errors | Critical | Validation process itself was insufficient |
| Significant metadata gaps | High | Post-migration remediation would be massive |
| Classification mismatches | High | Documents would be filed incorrectly in Vault |
| Audit trail concerns | Critical | Regulatory inspection risk |
The Business Calculation
| Option | Risk | Cost |
|---|---|---|
| Proceed with migration | Regulatory findings, SOP (Standard Operating Procedure) violations, audit risk | High migration cost + massive remediation |
| Halt migration | Documents remain on P-Drive | Profiling investment (sunk) |
Decision: The risk of migrating unreliable data into a regulated system exceeded the benefit. We recommended halting.
The Safer Path Forward
- Keep documents on P-Drive (known state)
- Complete parallel CRO migrations (cleaner data sources)
- Perform TMF completeness check against Vault
- Identify specific missing documents from P-Drive
- Migrate only verified, needed documents on demand
Results
Scope Reduction
| Stage | Document Count | Reduction |
|---|---|---|
| Initial estimate | 400,000+ | - |
| After initial profiling | ~180,000 | -55% |
| After duplicate removal | ~80,000 | -56% |
| After in-scope filtering | ~32,000 | -60% |
| Final decision | 0 migrated | Migration halted |
Business Impact
| Outcome | Value |
|---|---|
| Risk avoided | Regulatory inspection findings prevented |
| Cost avoided | Massive post-migration remediation |
| Compliance maintained | No SOP violations from migrating unverified content |
| Clarity delivered | True state of P-Drive data now understood |
Lessons Learned
-
Never scope a migration without profiling the data. This project was scoped based on folder structure, not content reality.
-
File names and folder structures are not trustworthy. Documents are routinely misfiled, mislabeled, and duplicated.
-
SME validation is necessary but not sufficient. Even human-validated samples contained errors that only surfaced in system testing.
-
ML inherits source data problems. If training data is mislabeled, the model learns to mislabel.
-
Sometimes the best migration is no migration. The courage to halt is harder than the momentum to proceed.
-
Dry runs are essential, not optional. Sandbox validation caught what profiling and SME review missed.
Impact
By halting the migration, the client avoided:
- Regulatory risk: Migrating mislabeled documents into a GxP-validated system would have created audit findings
- Remediation cost: Post-migration cleanup of 32,000+ documents would have consumed the regulatory team for months
- Operational chaos: Inspectors finding documents in wrong study folders during an audit
The profiling investment wasn’t wasted - it revealed the true state of the P-Drive and informed a safer path forward: complete parallel CRO migrations first, then pull only verified missing documents on demand.
Want to discuss document migration?
Facing a legacy migration with uncertain data quality? I can help you profile the risk before you commit - and know when to proceed versus when to stop. Get in touch.
Related work
Multi-Agent FDA Document Review
6-agent AI system for FDA document review - 60-70% time savings, multi-million USD ROI
AI Translation Pipeline for Pharma
AI translation integrated with Veeva Vault - 2+ week turnaround reduced to hours
Enterprise Data Governance Transformation
€400K engagement, 25+ stakeholders, 7 data domains mapped across 13 countries