Blog: Migrating an On-Premises SAS System to Microsoft Azure HDInsight

A global retailer in the beauty industry contacted DB Best looking for the right modernization path for their SAS based BI solution. They were hitting the boundaries of SAS with their large data sets and wanted to look at alternative solutions using big data technologies like Apache Hadoop with Azure HDInsight to modernize the way they process their data. In addition, the wanted to control their SAS license costs. In this blog post, we’ll walk through how we helped our long-term customer in migrating their SAS solution to a modern big data stack.

The process

Since our customer’s head decided upon decommissioning their SAS solution, we suggested doing a proof of concept project using Apache Hadoop technologies with Microsoft Azure HDInsight.

Our customer had specific requirements in terms of producing similar results and performance as they were getting with their SAS solution. Using the Azure HDInsight stack with parallel processing of data looked like a great fit.

Practical part

The customers provided us 20 files of SAS code that had to be run in a sequence. These files contained an ETL code of different complexity. They were all created by different people so contained heterogeneous data and called mostly to simple analytical functions.

After the initial analysis, we split those SAS files into 13 files of Analytical Dashboard sequence and a dozen files containing the other code.

We converted the Analytical Dashboard files to Python code that in the Zeppelin environment produces Hive SQL code run in the Zeppelin SH interface. Other SAS files we converted to Hive SQL without any troubles as the SAS code used SAS Base statements with minimal SAS Macro statements.

Current SAS Architecture and proposed Hadoop architecture

Challenges

Every migration project has its challenges. For this case, we worked with dynamic scripts and considerable volumes of data to process. As the SAS and Hadoop systems use very different ways to process the data, they needed a way to abstract these processes to improve testability.
To make things interesting, our customer’s SAS solution had stability issues. This meant that we had to understand the intent of the application so that our solution using Azure produced the correct results. Many of these issues dealt with the need to clean the data for consistent processing. Here are some of the specific challenges we encountered.

Processing NULL and missing values

The most common issue we encountered was related to the different processing of SAS Missing values and Hive NULL values. For example, the following query in SAS returns all the rows and observations that are not equal to 0 including the missing values, while in Hive SQL it returns the rows that are not equal to 0 and not equal to NULL.

SELECT
    column1,
    column2,
    column3
FROM
    table1
WHERE
    column1 <> 0

Thus, we had to modify the original query so that SAS and Hive systems produced the correspondent results.

SELECT
    column1,
    column2,
    column3
FROM
    table1
WHERE
    column1 <> 0
    OR column1 IS NULL

Processing mechanics that can’t be directly converted to Hive SQL

SAS has a set of specific not convertible to Hive SQL processing mechanics. In our experience, such cases referred to merge operators, conditional processing via SAS Base variables for intermediate calculations, “calculated” columns in “proc SQL”, etc.

These cases required deep analysis and custom solutions.

For example, SAS SQL syntax allows for the following expressions, while, Hive SQL does not support them.

SELECT
    col1 + col2 AS total1,
    col3 + calculated total1 AS total2
    col4 + calculated total1 AS total3
FROM
    table1

For Hive SQL we had to expand this query and repeat all of the calculations, as follows:

SELECT
    col1 + col2 AS total1,
    col3 + col1 + col2 AS total2
    col4 + col1 + col2 AS total3
FROM
    table1

Limited transpose in Hive SQL

SAS system has powerful and flexible inbuilt functionality that allows for transpose operations (i.e. swapping data between lines and columns). Hive SQL doesn’t have special functions or commands to provide for transpose operations. That is why any transpose operation performed by means of Hive SQL would be limited in functionality.

We had several calls to “proc transpose” to implement within the migration. To provide for a solution that would be as close to the solution existed in SAS as possible our specialists utilized the Python code and the functionality from the Pandas library.

Thus, we basically wrote the Python code that transposed in Hive SQL leveraging Pandas to retrieve additional metadata.
For example, to transpose Table 1 into Table 2 in SAS we use simple and flexible code:

proc transpose data = table1 out = table2 name = dimension2;
    id dimension1;
run;

It is crucial that the SAS code works properly even if we must extend the table or add more rows. For those cases, the system produces a corresponding number of rows and columns. Hive SQL doesn’t provide for this functionality so to get the required results we would have to use the following solution.

An example of Hive SQL code

SELECT
    table2.dimension2,
    SUM(table2.elem1) AS elem1,
    SUM(table2.elem2) AS elem2,
    SUM(table2.elem3) AS elem3
FROM
     (
        SELECT
            'value1' AS dimension2,
            CASE
                WHEN dimension1 = 'elem1' THEN value1
                ELSE NULL
            END AS elem1,
            CASE
                WHEN dimension1 = 'elem2' THEN value1
                ELSE NULL
            END AS elem2,
            CASE
                WHEN dimension1 = 'elem2' THEN value1
                ELSE NULL
            END AS elem3
        FROM
            table1

        UNION ALL

        SELECT
            'value2' AS dimension2,
            CASE
                WHEN dimension1 = 'elem1' THEN value2
                ELSE NULL
            END AS elem1,
            CASE
                WHEN dimension1 = 'elem2' THEN value2
                ELSE NULL
            END AS elem2,
            CASE
                WHEN dimension1 = 'elem2' THEN value2
                ELSE NULL
            END AS elem3
        FROM
            table1
    ) table2
GROUP BY
    table2.dimension2

The Hive SQL code is not as flexible as the SAS code, thus with the suggested solution if we extend the Table 1 the code just won’t process the newly added rows and columns. In addition, the Hive SQL code produces errors if we remove the columns from the table.

We invented an alternative way how to provide for required options utilizing the Python code and Pandas library (transpose() method of Pandas DataFrame).

hc_query = hive_context.sql("select dimension1, value1, value2 from table1")
pandas_query = hc_query.toPandas()
pandas_query_transposed = pandas_query.transpose()
hive_context.sql("drop table if exists table2")
hc_query_target = hive_context.createDataFrame(pandas_query_transposed)
hc_query_target.registerTempTable("table2")

Customer’s benefits

The DB Best team completed the migration and smoothly switched the business users of our customer from SAS to the new system based on Hadoop and Hive with Azure HDInsight. We did our best to make that switch with minimal downtime, while our customers got one step closer to SAS licenses decommission.

Considering the amount of data our customers own, this migration required very meticulous work. However, summarizing the efforts we can state two facts:

We can migrate any system

It’s possible to migrate any system, even if it is unstable or underconfigured. For this particular case, we experienced the difficulties having to migrate the system which configuration didn’t allow for specific operations, however, this fact can’t stop the process.

Equal performance

It’s almost impossible to create a big data solution that will work faster than a similar system created on SAS. Oftentimes, when migrating from SAS, the system performance decreases up to 10 times compared to initial performance on SAS.

Our team managed to migrate our customer’s system from SAS to Hive SQL and reach a similar performance with the new system with only a 1% difference. For migration of this kind, results such as these are a great achievement. Thus, our customer doesn’t lose their productivity but they still get perfectly processed data and analytics without having any of the inconveniences caused by migration.

If you are interested to start the modernization of your SAS solutions and other big data systems feel free to contact DB Best. Our team of qualified experts will be happy to provide you with a consultation about how to make your migration smooth and effective.

Share this...
Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Share on LinkedIn
Linkedin