Import Data Using SSIS and Transposing Columns to Rows

Problem

I have a requirement to import a text file, using SQL Server Integration Services (SSIS), that has multiple columns that need to be transposed to multiple rows. I can’t find any way to do this in SSIS. How can this be done?

Solution

This limitation of importing a varying number of columns can be overcome with a script task where we can use C# or VB.NET code to transpose columns to rows and store the valid data in the database, redirect the error rows to the database and fail the package if the file is not in the expected format. This script technique is quite useful when we want to load data into a non-SQL environment (Example: EDM platforms) as it avoids loading the data into SQL and makes use of T-SQL functions. Also, this script is flexible as it works irrespective of data type unlike some T-SQL functions.

Overview

There could be a scenario where a data provider has ‘N’ number of measures and may provide a data feed with dynamic columns depending on the data that was asked for and it is important to develop a solution which would accommodate this varying nature.

For example, a financial markets data feed could contain multiple measures and a data provider (i.e. Bloomberg, Reuters, FactSet, etc.) could supply only the information that we ask for instead of all the columns in every single feed.

Example Data Feed where number of facts could vary each time:

Company,Date,Datasource,Open Price,High Price,Low Price,Average Price 
Microsoft,2018-03-10,Traders,100.30,140.20,104.35,120.10 
Amazon,2018-03-10,Traders,120.75,140.23,130.80,130.50 

We need to build a solution to accommodate this varying nature of columns and the output that is needed is:

Company,Date,Datasource,Property Name,Property Value 
Microsoft,2018-03-10,Traders,Open Price,100.30 
Microsoft,2018-03-10,Traders,High Price,140.20 
Microsoft,2018-03-10,Traders,Low Price,104.35 
Microsoft,2018-03-10,Traders,Average Price,120.10
Amazon,2018-03-10,Traders,Open Price,120.75
Amazon,2018-03-10,Traders,High Price,140.23 
Amazon,2018-03-10,Traders,Low Price,130.80 
Amazon,2018-03-10,Traders,Average Price,130.50 

An example package has been developed to demonstrate this solution.

This package contains a script component under a data flow task which will act as a data source to read the input file and does the following steps:

  1. Check whether the header row has fixed columns coming through correctly. For example, you are expecting the first few columns to be fixed in the file and check whether those columns are coming through as the package may not behave correctly if it is not the case. The package needs to fail as the source file is not in the expected format.
  2. Check whether any rows are missing data by comparing the number of columns in each row with the header and redirect these rows to an error log.
  3. Transpose the varying number of columns to rows by reading the columns for each line in the file and storing the values in new rows. These are classified as valid rows and these output to the next step in the process.

Step 1 – Variable/Parameters Definition

We will create the following parameters for the package:

  • FileName – Fully qualified name for the input file.
  • LogFilePath – Path to store the error log file (with timestamp) where the incomplete rows would be redirected.

Step 2 – Configuring script component under a Data Flow

A data flow task was created and a script component was added by setting its type to Source and then the variable was configured in the script component. Please refer to the below picture for variable configuration.

The following outputs and columns have been added in the Inputs and Outputs section in the Script Transformation Editor. TransposedRows output will contain the transposed rows and ErrorRows will contain the error rows (for example, incomplete rows or to contain an error message that the file is not in the expected format).

Step 3 – C# script to process the file and output the rows

The script is then edited to process the file and create the output rows. The below code contains the comments to understand each bit of the code.

public override void CreateNewOutputRows()
{
    string[] columnvalues;
    //string variable to store line from the file when we loop through the lines
    string line=null; 
    int lineno = 0; //variable to keep track of line no whilst processing the file
    StreamReader fr = File.OpenText(Variables.FileName); //Open the File
    int headercolumnslength = 0; //variable to record header columns length
    string[] PropertyName=new string[1]; // Array to store dynamic column names

    if (fr != null)
    {
        line = fr.ReadLine(); //Read the first line of the file
        //Assuming the file is comma delimited, we split the column values
        columnvalues = line.Split(','); 

         //Calculate the no. of columns in the header row
        headercolumnslength = columnvalues.Length; 
        lineno++;

        /*Check whether there are any issues in the fixed columns in the file,
          Any issues here to be added to the errorrows output and fail the dataflow task  
         */

        if (columnvalues[0].Contains("Company") == false || columnvalues[1].Contains("Date") == false || columnvalues[2].Contains("Datasource") == false)
        {
            AddErrorRow("Error", 0, "Header is not in the expected format, Required fields   missing from file");
            bool pbCancel = false;
            this.ComponentMetaData.FireError(100, "Script Component", "Header not in excepted format", "", 0, out pbCancel);
            fr.Close();
            fr = null;
        }
        /*Loop through the file line by line
        1. Check whether the number of columns in each line matches with no. of columns in the header line. 
           Any mismatch to be redirected to ErrorRows output
        2. The example file contains 3 fixed columns and therefore the varying no. of columns 
           in the first row (header row) to be saved onto PropertyName column in the TransposedRows output.
        3. Varying no. of columns for each line to be saved onto PropertyValue column in the TransposedRows output

           */
        else
        {
            while (fr.EndOfStream == false)
            {
                if (lineno != 1)
                {
                    line = fr.ReadLine();
                    columnvalues = line.Split(',');
                }
                if (lineno++ >= 2) //proceed if the line is not header
                {
                   if (columnvalues.Length != headercolumnslength)
                    {
                        AddErrorRow("Error", lineno - 1, "Number of columns does not match header row for the Company " + columnvalues[0]);
                    }
                    else
                    {
                        for (int i = 3; i < columnvalues.Length; i++)
                        {
                            TransposedRowsBuffer.AddRow();
                            TransposedRowsBuffer.Company = columnvalues[0];
                            TransposedRowsBuffer.Date = columnvalues[1];
                            TransposedRowsBuffer.Datasource = columnvalues[2];
                            TransposedRowsBuffer.PropertyName = PropertyName[i - 3];
                            if (string.IsNullOrEmpty(columnvalues[i]) == false)
                            { TransposedRowsBuffer.PropertyValue = Convert.ToDecimal(columnvalues[i]); }
                        }
                    }
                }
                else
                {
                    PropertyName = new string[columnvalues.Length - 3];
                    for (int i = 3; i < columnvalues.Length; i++)
                    {
                        PropertyName[i - 3] = columnvalues[i];
                    }
                }
            }
        }
        if (fr != null)
        {
            fr.Close();
        }
    }

}
/*AddErrorRow Function to be called to add any error rows to the ErrorRows Output with appropriate message*/
private void AddErrorRow(string errorType, long lineno,string message)
{
    ErrorRowsBuffer.AddRow();
    ErrorRowsBuffer.Lineno = lineno;
    ErrorRowsBuffer.FileName = Variables.FileName;
    ErrorRowsBuffer.ErrorMessage = message;
    ErrorRowsBuffer.Timestamp = DateTime.Now.ToString("yyyy-MM-dd h:mm tt");
}
			

The script does the following:

  • Checks whether the file is in the expected format – In this example, we have 3 fixed columns and that’s why we assign the value 3 in the “for” loop. It reads the first line which is the header and finds out whether the expected fixed columns are coming through. If it is not coming through as expected then the dataflow task fails by the FireError method after adding the message in the ErrorRows Output.
  • Check whether the rows are incomplete – It reads each line in the files and compares it to the number of columns in the header to identify whether there are any ragged rows and if anything found needs to be marked in the ErrorRows Output, so that those lines could be redirected to an error log.
  • Transpose columns to rows
    • The header row is taken first and puts the varying column values into the PropertyName array.
    • Then the other lines in the file are read one by one to extract the actual values and stored in the PropertyValue array along with the respective PropertyName and these rows are then stored as TransposedRows output.
  • AddError is a function which can be called to store any error message in the ErrorRows output.

Step 4 – Redirecting the outputs

The outputs need to be redirected accordingly. The below screenshot shows that the “TransposedRows” output which contain the valid rows (after transposing from columns to rows) is being redirected to a SQL destination and “ErrorRows” output gets redirected to an error log which is a flat file destination.

Execution Results

Let’s take an example file with 3 rows (2 valid rows and 1 ragged row) and we would expect the 2 valid rows get transposed and redirected to the database and the 1 bad row gets redirected to the error log.

In the below example, the company Amazon is incomplete (ragged) and should be redirected to the error log.

Company,Date,Datasource,Open Price,High Price,Low Price,Average Price 
Microsoft,2018-03-10,Traders,100.30,140.20,104.35,120.10 
Alphabet,2018-03-10,Traders,70.80,90.60,72.20,80.10 
Amazon,2018-03-10,Traders,120.75,140.23 

Results stored in Database

We can see that for Microsoft and Alphabet we have 4 rows for each company.

Results redirected to Error Log

We can see below that the row for Amazon was a problem and was written to the error log file.

Similarly, if the header row does not have the right columns that you were expecting the package would fail after adding the message in the ErrorRows output and redirecting it to the log file.

Next Steps

Last Update:
2018-04-20

About the author

Balamurugan Loganathan is a UK based Data Warehousing and Business Intelligence professional with 13 years of experience.

View all my tips

Bir Cevap Yazın

E-posta hesabınız yayımlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir

TOP