24 April 2023
It is not easy to set up a reliable and secure authentication in an environment where a lot of data can be valuable and critical, like a data warehouse, a data mesh, or a data lake. SSO is one of the best practices for authentication processes especially when it comes to sensitive or critical applications.
A little reminder on SSO
SSO is a user authentication service and a session that allows this user to log in with the same username and password to multiple applications. Indeed, you delegate the authentication to a third-party application that will authorize you, or not, to access some applications. It permits you to centralize and manage all of your users’ authentication in a single point.
Guillaume wrote a very useful case on using SSH if you have a Bastion which includes using SSO. There are different protocols of authentication based on the purpose of the service or user that wants to connect, the 3 principles are those :
- SAML 2.0: it’s a standard for authorization and authentication
- OAuth 2.0: it is a standard for authorization, mainly used for API authorization, especially when an API call comes in between two applications
- OpenID Connect: it is a standard for authentication, this is used if someone wants to log in to an application through the credentials of another that permits it
💡 SAML and OAuth/OpenID are two different types of protocols, they can be both used in an SSO authentication as well as only them.
With all of this, you can also reinforce your authentication through other processes like MFA (Multi-Factor Authentication). With MFA, you are required to provide at least two verification factors to gain access to the resource needed.
Data warehouse and Snowflake
Why a data warehouse in the cloud?
To store your data, there are many ways to do it, you can decentralize it with databases spread across your project, or you can use a data lake, a data mesh, or a data warehouse. A data warehouse gathers structured data from various sources and serves as a repository for the entire company. The purpose of this warehouse is to store data that have been ordered, combined, and consolidated.
Hosting your data warehouse in the cloud can have several advantages. First of all, by reducing cost through elasticity, the system can scale instantly and can be flexible. With computing and storage separate, you only need to buy what's essential.
You can also deploy your solution faster as well as build your own data warehouse in just minutes, depending on the size of the course.
You can optimize your storage: since a data warehouse already has its data reduced because it is sorted, it takes up less space in the cloud and is, therefore, more cost-effective.
You can store data in self-service, indeed data from data warehouses are more available and accessible since it is sorted. This allows IT and DBA teams to focus their attention and resources on more strategic aspects of the business.
Little benchmark of existing data warehouse in the Cloud
Here is a quick benchmark on what you can expect to find on the market if you are trying to build your data warehouse in the cloud:
- Server Management:
- Serverless: The solution manages the servers
- Self-Managed: The customer manages the servers
- Pricing: This is the billing mode used, in flat rate mode, the resources are exclusively allocated to the customer for a defined period, unlike on-demand
- Separate storage and computation: This offers the possibility to upgrade the storage and computation servers independently
- Elastic Resizing: This allows you to only use the necessary resources, to decrease or increase them when needed. Therefore, this allows you to reduce your cost
- Pause Cluster: Allows you to pause unused clusters versus shutting them down, saving time and money
Why you might want to use Snowflake
In my opinion, Snowflake stands out from its competitors in two aspects
- Centralized data sharing: Snowflake allows you to share databases live with customers and business partners, connect with consumers and data providers, and share your data with another Snowflake account (or not) in real time at no extra cost.
- Snowflake offers to localize its data on the 3 main Cloud Providers (AWS, GCP, Azure), allowing to leverage the best of each. The multi-cloud approach can be useful in multiple ways, you can customize your cloud computing capabilities by combining clouds that provide different services. Appropriate combining IaaS options can increase your efficiency. Multi-cloud storage allows you to spread your data across multiple cloud providers which improves your resiliency. It can also prevent you from shadow computing.
Advantages and disadvantages of the solution
The solution we are talking about
You are using Snowflake with the SSO authentication and you have an Identity Provider
First of all, you should not underestimate your authentication process within your company. Authentication is a serious subject, and without a strong and secure one, you expose your company or your project to great risks. This is even more true when it comes to structures that store sensitive data.
SSO is once again a reliable and secure method to automate this process. Not only you can centralize this process, which will make it less tedious, but you will also be able to manage your users more easily via your identity provider and reduce attack exposure.
To implement this kind of solution, you usually need knowledge of all three, the data warehouse, the ID provider, and the SSO protocol you want to set up.
If one of your SSO users’ accounts has been compromised, then the hacker will have access to every application that you have.
As said earlier, the fact that Snowflake handles data sharing and offers to localize your data warehouse in 3 cloud providers makes it hard to configure. Snowflake provides a lot of documentation on how to implement your solution according to the applications and tools you might want to use. I highly encourage you to take a look at that.
How to implement my solution
If you want to develop an application, you are going to need to implement connectors, according to your application. If you only want to access it through your users, you can take a look at the Authentication page.
What if you don’t want to use SSO?
If you really don’t want to use SSO, you can use a classic username/password authentication, with credentials stored in a database.
Classic authentication is less user-friendly, users will have to remember more credentials and will have to interact more with different applications.
Snowflake also handles Key Pair Authentication & Key Pair Rotation and OAuth (without SSO). OAuth is a reliable tool to make your authenticate process reliable as well as a Key Pair Authentication but once again I highly encourage you to use SSO if possible.
What tools are for my data scientists?
If you plan to exploit the data you are storing in your data warehouse, which might be the case, you might wonder about the tools that your data scientist can use. Unfortunately, Snowflake has some lake of integration with some tools. They developed connectors based on what tools you want to use. But those connectors defer from one to another.
For example, if you want to use Spark, you are gonna need to use the Spark Connector from Snowflake, this one doesn’t handle web-based authentication which might be a problem. Indeed, you will need to provide a token for each authentication that you will provide in the connector. To hide this token from being stolen, you will have to implement a process that does it which can be painful and not always well done.
Snowpark is a library developed by Snowflake that provides an API for querying and processing data. A development experience that allows you to write code in your preferred language (Scala, Java, Python) and run it directly in Snowflake. It is a good alternative to the other tools above, you might encounter fewer compatibility problems and it should be easier to implement both your authentication processes and your data scientist tool.
SSO is a must-have in authentication processes, coupled with databases containing sensitive data, it becomes almost mandatory. Snowflake is a growing solution in the data warehouse Cloud environment and it is interesting to understand that the company responds well to the problem but lacks in my opinion of compatibility with some tools.