Executive Summary
A major telecommunications provider in the United States — supporting tens of millions of wireless, broadband, and enterprise subscribers — had accumulated one of the largest SAS estates in the industry. Spanning 4,200 SAS programs, 3.8 million lines of code, deep macro libraries, and a combined SAS Enterprise Guide and SAS Data Integration Studio deployment, the environment consumed over $6 million annually in SAS licensing and supporting infrastructure. Following a board-level decision to standardize on Databricks as the enterprise analytics platform, the organization selected MigryX to execute the full SAS-to-Databricks migration. Over 18 months, MigryX's AST-based parser resolved the entire macro inheritance hierarchy, converted all PROC calls and DATA step logic to optimized PySpark, and deployed the migrated estate to Databricks with Unity Catalog governance. The program produced over 2 million lines of PySpark, performance improvements of 3–6X on subscriber analytics and network telemetry pipelines, and a projected $8.5 million in three-year savings.
Client Overview
The client is a facilities-based telecommunications carrier offering wireless, fiber broadband, and enterprise networking services nationwide. Their analytics function is critical to subscriber lifecycle management — from acquisition propensity modeling and network capacity planning to churn prediction, billing dispute resolution, and regulatory reporting to the FCC. The SAS estate had been built incrementally over 18 years, starting with a small team of biostatistics-trained analysts and growing through four major acquisitions that each added their own SAS macro libraries and PROC-based reporting frameworks. By the time of the migration assessment, the estate encompassed work from hundreds of analysts and engineers across marketing, finance, network operations, and regulatory affairs — many of whom were no longer with the organization.
The strategic rationale for migration was clear: SAS licensing costs had grown substantially through successive SAS Institute contract renewals, the SAS Enterprise Guide interface was increasingly at odds with the organization's data science talent pipeline (new hires expected Python and cloud-native tooling), and the organization's Databricks investment — already substantial for new-build ML and streaming analytics use cases — was producing significant underutilization on per-workspace compute commitments.
Business Challenge
The SAS-to-Databricks migration presented a distinctive set of challenges compared to other ETL platform migrations, driven by SAS's unique combination of a proprietary programming language, a rich procedural library, and deeply nested macro capabilities:
- Macro-heavy codebase: Approximately 65% of the 4,200 programs invoked one or more SAS macros, and the macro library contained over 1,400 distinct macro definitions — including nested macros calling other macros up to seven levels deep. A naive text-substitution approach to macro resolution would have produced unmaintainable, bloated output code; correct migration required full Abstract Syntax Tree (AST) parsing with scope-aware macro resolution.
- PROC diversity: The estate used 47 distinct SAS PROCs, ranging from ubiquitous data management procedures (PROC SORT, PROC SQL, PROC TRANSPOSE) to statistical analytics PROCs (PROC LOGISTIC, PROC REG, PROC MIXED) and specialized reporting procedures (PROC REPORT, PROC TABULATE). Each PROC required a distinct mapping strategy to its PySpark or Databricks MLlib equivalent.
- DATA step complexity: SAS DATA steps implementing merge operations (especially the complex BY-variable merge with IN= dataset options), array processing, hash object lookups, and implicit output behavior represented the most semantically dense part of the codebase and the component most prone to subtle behavioral differences in naive translations.
- SAS DI Studio job flows: The DI Studio environment contained 640 job flows (graphical ETL pipelines) that needed to be converted to Databricks Workflows, including complex dependency chains with loop constructs and conditional execution paths that were not visible from the SAS program files alone.
- Implicit SAS dataset semantics: SAS DATA steps frequently rely on implicit dataset iteration, automatic variable retention across rows (
RETAINstatement), and SET-with-BY-KEY pattern lookups that have no syntactic equivalent in PySpark's functional DataFrame API and must be restructured using window functions or explicit join patterns. - Statistical PROC outputs: Reporting and statistical PROCs produce structured output datasets (ODS destinations) with specific column naming and format conventions that downstream programs depend on. Preserving these output schemas exactly was critical for programs that chained analytical PROCs in multi-step pipelines.
The MigryX Approach
MigryX deployed its SAS AST parser as the foundation of the conversion pipeline. Unlike regex-based or pattern-matching approaches, the AST parser fully tokenizes and parses each SAS program into a language-aware syntax tree that correctly understands scope rules, macro variable resolution order, conditional compilation (%IF/%THEN/%ELSE), and the interaction between macro and DATA step boundaries. This allows MigryX to resolve the full effective code of each program — with all macro calls expanded in context — before generating the target PySpark representation.
The macro resolution phase was particularly critical for this client. The 1,400-macro library included several deeply interdependent macro families developed for subscriber segmentation, network KPI calculation, and regulatory reporting. MigryX's resolver built a complete macro call graph, identified circular reference risks (which were flagged and resolved with client SMEs), and generated expanded representations that were then optimized to eliminate redundant intermediate datasets before PySpark generation. The resulting PySpark code for macro-heavy programs was, in most cases, more concise than the macro-expanded SAS representation.
DATA step conversions employed MigryX's semantic transformation library, which maps SAS DATA step constructs to their structural PySpark equivalents. The MERGE statement with multiple IN= options was converted to join chains with explicit null-checking and row-source tagging. RETAIN-based running calculations were converted to PySpark window functions with appropriate frame specifications. Hash object lookups were converted to broadcast-joined reference DataFrames. Array processing with DO OVER loops was converted to PySpark explode/pivot patterns or vectorized column operations as appropriate for the array's cardinality and use pattern.
SAS DI Studio job flows were extracted via the DI Studio metadata export API, parsed to recover the job dependency graph, and converted to Databricks Workflow task definitions. The resulting Workflows maintained the original scheduling cadence, dependency semantics, and error handling behaviors while adding Databricks-native monitoring, alerting, and retry policies. Unity Catalog was configured as the governance layer for all migrated datasets, with column-level access controls aligned to the client's existing data classification taxonomy — a requirement driven by FCC CPNI (Customer Proprietary Network Information) regulations and the client's privacy program under state-level consumer protection laws.
SAS PROC Mapping Reference
| SAS PROC | PySpark / Databricks Equivalent | Conversion Notes |
|---|---|---|
| PROC SORT | DataFrame.orderBy() |
NODUPKEY option mapped to .dropDuplicates() on BY keys |
| PROC SQL | spark.sql() / DataFrame API |
SAS SQL extensions (CALCULATED, OUTOBS) resolved to standard SQL |
| PROC TRANSPOSE | DataFrame.groupBy().pivot() |
ID, VAR, and BY variable mappings preserved exactly |
| PROC MEANS / SUMMARY | groupBy().agg(mean, std, min, max) |
CLASS and VAR variable structure mapped to groupBy keys and agg columns |
| PROC FREQ | groupBy().count() + cross-tab pivot |
TABLES statement cross-tabulation converted to pivot with count aggregation |
| PROC LOGISTIC | Databricks MLlib LogisticRegression |
Parameter options mapped; model outputs stored in Delta Lake |
| PROC REG | Databricks MLlib LinearRegression |
MODEL statement variables mapped to feature/label columns |
| PROC MIXED | Custom PySpark + statsmodels UDF | Mixed model fitting requires Python statsmodels via Pandas UDF |
| PROC REPORT | Databricks SQL + Delta table output | DEFINE variable options mapped to SQL aggregations and column aliases |
| PROC TABULATE | groupBy().pivot().agg() chain |
CLASS hierarchy and TABLE statement structure fully preserved |
| PROC IMPORT / EXPORT | spark.read.csv/excel() / .write |
DBMS= options mapped to Spark format readers with equivalent options |
| DATA Step (MERGE) | PySpark join chain with IN= tracking columns | Complex BY-variable merges require semantic analysis |
Key Migration Highlights
- MigryX's AST parser successfully resolved all 1,400 macro definitions across the seven-level nesting hierarchy with zero unresolved macro references, eliminating the single largest risk factor that had caused previous manual migration attempts to stall.
- 3,560 of 4,200 programs (85%) were migrated to production-ready PySpark through fully automated conversion; the remaining 640 received MigryX-assisted engineering review, with the most complex PROC MIXED and hash-object-heavy DATA steps prioritized for human review.
- Subscriber churn prediction pipeline execution time reduced from over 6 hours to approximately one hour on Databricks, enabling the marketing team to run daily model refreshes that were previously restricted to weekly cadence.
- Unity Catalog column-level access controls replaced a complex, manually maintained SAS library access control spreadsheet that had been the source of multiple CPNI compliance findings in prior FCC audits.
- All 640 SAS DI Studio job flows were converted to Databricks Workflows with complete dependency preservation; no job flow required manual reconstruction of the dependency graph.
- MigryX delivered a comprehensive PROC mapping validation report demonstrating output-equivalence across all 47 PROC types used in the estate, backed by automated test harnesses running against 12 months of historical input/output pairs.
Security & Compliance
As a regulated telecommunications carrier, the client operates under FCC CPNI rules, state-level Consumer Privacy Acts (including California CPRA and New York SHIELD Act), PCI DSS for billing systems, and SOX financial reporting controls. The migration program was scoped to maintain compliance posture throughout the transition, with no degradation of existing controls permitted.
- CPNI data handling: All SAS datasets classified as CPNI were tagged in the migration inventory and subjected to enhanced access control configuration in Databricks Unity Catalog, with row-level security policies limiting access to authorized subscriber service representatives.
- PCI DSS scope reduction: Migration provided an opportunity to remove payment card data from SAS analytical datasets where it had historically been retained unnecessarily; the migrated pipelines were redesigned to use tokenized card reference identifiers throughout.
- SOX control preservation: Financial reporting programs (revenue recognition, subscriber count reporting) were subject to a formal control mapping exercise ensuring that all SOX ITGCs applicable to the source SAS programs were re-implemented in the Databricks environment with equivalent or stronger controls.
- On-premise migration tooling: All MigryX conversion tools ran within the client's AWS environment with no source code or data transmitted to external systems.
- Segregation of duties: Databricks Unity Catalog permission structures enforced production/development/test environment segregation that had been manual and inconsistently applied in the SAS environment.
Results & Business Impact
The 18-month program concluded with all 4,200 SAS programs successfully migrated, validated, and operating in production on Databricks. The SAS Enterprise Guide environment was decommissioned three months ahead of the original target date, and the SAS DI Studio metadata server was shut down upon completion of the final job flow cutover. The program's timeline and methodology served as the blueprint for two subsequent migration programs covering acquired-company SAS estates.
The $8.5 million three-year savings projection is anchored by $5.8 million in eliminated SAS Institute licensing fees (Enterprise Guide, Base SAS, SAS/STAT, SAS/ETS, and DI Studio components), $1.7 million in reduced SAS Metadata Server and compute infrastructure operating costs, and $1.0 million in reduced SAS administration and developer support labor. These savings materially contribute to the organization's stated goal of reducing legacy analytics platform spend by 70% within five years to fund AI/ML capability buildout on Databricks.
"Our SAS estate was the result of 18 years of organic growth through acquisitions and analyst-driven development. We had programs that nobody fully understood, macro libraries that were treated as black boxes, and a licensing bill that went up every year regardless of what we negotiated. The AST-based approach gave us the first complete picture of what our SAS estate was doing, at a level of detail we'd never had before. The licensing savings in the first year substantially offset the program investment."
— Vice President of Data Engineering, Major US Telecommunications Provider
Ready to Modernize Your SAS Estate?
See how MigryX can accelerate your migration to Databricks with AST-driven automation. Full macro resolution. Comprehensive PROC coverage. Automated validation.
Explore Databricks Migration →