When to Halt a Migration

400,000+ documents profiled, 92% scope reduction, migration halted—sometimes stopping is the right answer

Data MigrationPharmaRisk ManagementVeeva
Client mRNA biotech company
Period April 2024 — October 2024
Role Track Lead
Key Impact:
92% scope reduction (400,000+ → 32,000 documents) | Regulatory risk avoided by halting before production migration | Safer path forward identified through parallel CRO migrations
Veeva Vault eTMFML-assisted document classificationContent hashing / similarity scoring

The Problem

Why This Migration?

Pharmaceutical companies must maintain Trial Master Files (TMFs) for regulatory inspections. The client’s legacy P-Drive file share had accumulated documents over years of clinical trials—mixed with drafts, duplicates, and non-essential files—lacking the structure required for inspection readiness.

RiskImpact
Regulatory inspectionDisorganized TMF = critical audit findings
Document accessibilityInspectors need fast, accurate retrieval
ComplianceGxP requirements demand complete audit trails
Operational burdenPost-migration cleanup consumes regulatory team capacity

The Hidden Problem

The project was scoped without insights into the actual data. Initial estimates assumed folder structures reflected content reality. They did not.


The Solution

Approach: Profile First, Validate Thoroughly

Traditional migration: estimate scope → migrate everything → clean up after.

Our approach: Profile first → validate thoroughly → proceed only if safe.

WHAT WE EXPECTED                    WHAT WE FOUND
────────────────                    ──────────────
400,000 documents                   Massive duplication
  ↓                                   ↓
Profile and descope                 Misfiled across studies
  ↓                                   ↓
Migrate ~32,000                     SME validation insufficient
  ↓                                   ↓
Done                                ⛔ Risk too high → HALT

Data Processing Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    SOURCE: P-DRIVE                          │
│              (400,000+ documents initial estimate)          │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              PHASE 1: DATA PROFILING                        │
│  - Directory structure analysis                             │
│  - File type/format inventory                               │
│  - Statistical sampling for quality assessment              │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              PHASE 2: SCOPE DETERMINATION                   │
│  - In-scope vs. out-of-scope classification                 │
│  - Duplicate detection (cross-CRO, cross-study)             │
│  - SME validation of borderline cases                       │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              PHASE 3: ML-ASSISTED CLASSIFICATION            │
│  - Document categorization against TMF Reference Model      │
│  - Confidence scoring for manual review prioritization      │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              PHASE 4: DRY RUN VALIDATION                    │
│  - Migration to Veeva Vault eTMF Sandbox                    │
│  - Sample verification against source                       │
│  - Gap analysis and risk assessment                         │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                    ⛔ MIGRATION HALTED                       │
│  - Critical gaps identified in dry run                      │
│  - Even SME-validated samples contained errors              │
│  - Risk deemed too high to proceed                          │
└─────────────────────────────────────────────────────────────┘

What Profiling Revealed

FindingImpact
Duplicate documentsSame documents across CRO and client copies
Misfiled documentsDocuments in wrong study folders
Non-TMF contentPersonal checklists, draft notes, incomplete documents
Metadata gapsMissing dates, versions, study assignments
Untrustworthy structureFile names and folders did not reflect actual content

Critical insight: During 3 weeks of sample review, only 1 document was identified as actually requiring migration from certain folders.


Implementation

My Role

AreaActivities
Technical leadershipMigration methodology, data profiling approach, quality criteria
Team coordinationData scientists, ML engineers, SMEs
Stakeholder managementClient project leads, subject matter experts
Risk managementScope control, timeline, quality assurance
ComplianceGxP validation, audit trail requirements

Team Structure

  • Client team: Project leads, subject matter experts
  • Consulting SMEs: Scoping specialists
  • Technical team: Data scientists, ML engineers

Technology Stack

ComponentTechnology
Migration FrameworkProprietary migration framework
Target SystemVeeva Vault eTMF
ClassificationML/AI algorithms for document categorization
Duplicate DetectionContent hashing + similarity scoring
InfrastructureGxP (TIER1) compliant, locally deployed migration server

The Halt Decision

Dry Run Results

First dry-run migration to Veeva Vault eTMF Sandbox revealed:

IssueSeverityImplication
SME-validated samples contained errorsCriticalValidation process itself was insufficient
Significant metadata gapsHighPost-migration remediation would be massive
Classification mismatchesHighDocuments would be filed incorrectly in Vault
Audit trail concernsCriticalRegulatory inspection risk

The Business Calculation

OptionRiskCost
Proceed with migrationRegulatory findings, SOP violations, audit riskHigh migration cost + massive remediation
Halt migrationDocuments remain on P-DriveProfiling investment (sunk)

Decision: The risk of migrating unreliable data into a regulated system exceeded the benefit. We recommended halting.

The Safer Path Forward

  1. Keep documents on P-Drive (known state)
  2. Complete parallel CRO migrations (cleaner data sources)
  3. Perform TMF completeness check against Vault
  4. Identify specific missing documents from P-Drive
  5. Migrate only verified, needed documents on demand

Results

Scope Reduction

StageDocument CountReduction
Initial estimate400,000+
After initial profiling~180,000-55%
After duplicate removal~80,000-56%
After in-scope filtering~32,000-60%
Final decision0 migratedMigration halted

Business Impact

OutcomeValue
Risk avoidedRegulatory inspection findings prevented
Cost avoidedMassive post-migration remediation
Compliance maintainedNo SOP violations from migrating unverified content
Clarity deliveredTrue state of P-Drive data now understood

Lessons Learned

  1. Never scope a migration without profiling the data. This project was scoped based on folder structure, not content reality.

  2. File names and folder structures are not trustworthy. Documents are routinely misfiled, mislabeled, and duplicated.

  3. SME validation is necessary but not sufficient. Even human-validated samples contained errors that only surfaced in system testing.

  4. ML inherits source data problems. If training data is mislabeled, the model learns to mislabel.

  5. Sometimes the best migration is no migration. The courage to halt is harder than the momentum to proceed.

  6. Dry runs are essential, not optional. Sandbox validation caught what profiling and SME review missed.


Impact

By halting the migration, the client avoided:

  • Regulatory risk: Migrating mislabeled documents into a GxP-validated system would have created audit findings
  • Remediation cost: Post-migration cleanup of 32,000+ documents would have consumed the regulatory team for months
  • Operational chaos: Inspectors finding documents in wrong study folders during an audit

The profiling investment wasn’t wasted—it revealed the true state of the P-Drive and informed a safer path forward: complete parallel CRO migrations first, then pull only verified missing documents on demand.


Want to discuss document migration?

Facing a legacy migration with uncertain data quality? I can help you profile the risk before you commit—and know when to proceed versus when to stop. Get in touch.

Let's Build Something

Taking on new work.

I build AI workflows and agents that actually run in production—and stick around to maintain them.

Best fit: growing companies where ops can't keep up with volume, teams who tried AI and got burned, or regulated industries where you can't afford to get it wrong.

Based in Copenhagen. Available for remote or on-site (SF, NY, London).

What to expect: I respond within a few days. If there's a fit, we'll find 30 minutes for coffee or a call.

Have a quick question? — an AI that knows my work.

Book a Call

Skip the back-and-forth. Pick a time that works for you and let's talk about your project.

Book a 30-minute call →

Send a Message

Prefer email? Drop me a note and I'll get back within a few days.