Many organizations utilize a single sign on (SSO) provider to improve security and manage user access. This single sign on can include two-factor authentication, which requires a user to input a code from a secondary device to confirm that they are the individual looking to log into a service. While this can seem like a tedious to-do for an active user, it becomes a sturdy obstacle for data engineers looking to connect to data behind that SSO wall in their extract, transform and load (ETL) scripts. However, there is a way to still securely access data in Snowflake that utilizes SSO for log in.
Let's break down the steps needed to pull the data. First, install and import the necessary libraries:
# Import OS to extract environment variables
import os
# For best security practices, utilizes environment variables to secure sensitive information like passwords
from dotenv import load_dotenv# Load in environment variables
load_dotenv(dotenv_path="development_environment_variables.env")# Import pandas for converting the final dataset to a pandas dataframe
import pandas as pd# Use logging to generate logs and identify errors / issues when not running from an IDE or console
import logging# Use polars for quickly loading and manipulating large datasets. If data is relatively small, this can be replaced with pandas
import polars as pl# Snowflake SQL Alchemy connector allows for connecting to Snowflake through SQL Alchemy
from snowflake.sqlalchemy import URL# Import SQL alchemy to create the engine to connect to a database
from sqlalchemy import create_engine
# Configure logging to write to a text file
logging.basicConfig(filename=r'Power BI Logging\example.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
By importing the following libraries, you can follow best security practices by not keeping plain text passwords or database information in your script. While not essential for the program to run, having log files can help to diagnose problems that may arise in the future if data doesn't load in the future or you need to investigate errors or slow downs. Once all of the necessary libraries are added, create a new file to store your SQL query. If you are only running one query in your script, you can choose to embed the query into the script as a string. However, if the query will be modified in the future, if you are going to run multiple queries, or if the query is long and complex, it is best to keep the query text in a separate .txt or .sql file. Having the file saved as a .SQL file will allow your IDE to use IntelliSense to help in writing the query and ensuring proper formatting, but a text file will be just as functional.
Next you can begin pulling the data itself:
# Load your SQL Query Text
with open("Query.SQL") as file:
sql_query = file.read()
# Create connection parameter dictionary to pass credentials and authentication method into engine
connection_params = dict(
account=os.getenv('snowflake_account').
user=os.getenv('snowflake_user'),
password=os.getenv('snowflake_password'),
warehouse=os.getenv('snowflake_warehouse'),
database=os.getenv('snowflake_database'),
schema=os.getenv('snowflake_schema'),
role=os.getenv('snowflake_role'),
authenticator='externalbrowser'
)
# Create SQLAlchemy Engine using connection parameters
engine = create_engine(URL(**connection_params))
# Create a connection to Snowflake using the engine
con = engine.connect()
# Execute the SQL query and pass the data to a polars dataframe
raw_data = pl.read_database(sql_query,connection=con, infer_schema_length=1000)# Convert to a pandas dataframe
power_bi_data = raw_data.to_pandas()
One of the most essential components of the above script is specifying "externalbrowser" as the authenticator. If you are unable to get an authentication token from the team that adminsters single sign on or Snowflake, you will have to utilize your organization's preferred browser to log in through single sign on. This will also initiate the two-factor authentication process. While this necessarily means that data would have to be loaded during normal working hours instead of during off-peak hours as is traditionally done to minimize cost, it does provide a way to load in and automate most of the ETL process in a way that abides by standard security practices.