Ahoy!

It's your weekly set of articles, my observations about data platforms, primarily focused on Azure cloud. 

Important: Mastermind #2 session about DataOps next Thursday (May 13 18:00 CET). Sign up here or reply directly to me.


Summary in 60 seconds

Azure Synapse

I know many data professionals still patiently waiting for Microsoft to deliver on it's Azure Synapse promises. 

In my view, Synapse might become a brilliant all-in-one service, but it's not there yet. If you are about to modernize your data stack now, you might take a look at the Lakehouse architecture.

 

Rise of the Lakehouse

The Lakehouse term, initially coined by Databricks, seems to grow in popularity. Vendors like Snowflake, Microsoft, Informatica, Dremio, Fivetran, and many others, market themselves as a crucial component behind Lakehouse.    

However, the true hero behind the Lakehouse are open source file formats (Delta LakeApache IcebergApache Hudi). They allow to merge data lakes and data warehouses into a single system. 

 

Scaling Data: Data Informed to Data Driven to Data Led

An article that I really enjoyed, I saw many peers liked it too. Sharing it here if you had missed it.

"The right balance in data projects is achieved with a thoughtful sequencing of architectural engineering work, analytics, and application of analytics to business and product."

Though the article is targeted at start-up founders, there is a lot of things that I can relate to while working at a large insurance company.

Read more below.


Azure Synapse - will it live up to all the expectations? 

Microsoft works hard behind the curtains to create a service able to compete with Snowflake, Databricks, Google BigQuery, AWS Redshift. 

 

Azure Synapse SQL (dedicated and serverless)

I know many customers successfully using a dedicated Synapse Pools (a.k.a Azure SQL Data Warehouse). It's a mature MPP offering, a bit old-fashioned, coupled storage and compute.

Serverless SQL pool is a rather new query service with pay per query pricing model.

I can't wait to see their upgrades with the Polaris engine, which should decouple storage and compute (similar to Snowflake warehouse).

Hopefully it pans out well, not clear when ...   

 

Azure Synapse Spark Pools

You won't be required to use Databricks anymore if you require a managed Spark environment and notebooks.

Yet, I find Spark Polls rather limiting and I am not ready to fully jump into it. Databricks still seem to be a superior service (and might remain untouched).

Check out this great overview of Synapse Spark Pools by Simon Whiteley.

 

Azure Synapse Pipelines

It is a subset of Azure Data Factory. Over time, I expect both services to merge to offer one orchestration and transformation capability via Synapse Pipelines.

 

Other features

PowerBI, Purview, source control or common data model integrations seem to be super interesting. But the majority of my peers, including me, wait for the features above, a proper foundation. 

 

Recommendation

  1. Synapse might become a brilliant all-in-one service, but it's not there yet. If you are about to modernize now, look at the Lakehouse architecture.
  2. Check out a recent article by Paul Andrew and his thoughts about Synapse Analytics.
  3. Read how Synapse and Databricks can coexist.
  4. Read more about Synapse on Microsoft documentation.


Question for you:

What do you think about Synapse vs. Snowflake / Databricks / BigQuery / Redshift?


Lakehouse

I've coming back to this Lakehouse paper on CIDR over and over again.

"The first key idea we propose for implementing a Lakehouse is to have the system store data in a low-cost object store (e.g., Amazon S3) using a standard file format such as Apache Parquet, but implement a transactional metadata layer on top of the object store that defines which objects are part of a table version"

"This allows the system to implement management features such as ACID transactions or versioning within the metadata layer, while keeping the bulk of the data in the low-cost object store and allowing clients to directly read objects from this store using a standard file format"

Do you wonder how to use Lakehouse with Synapse? Stay tuned as Synapse has support for Delta in Synapse Pools in preview. 

Also, check out this recently published interview about the lakehouse.


Latest Azure and Databricks articles


Scaling Data: Data Informed to Data Driven to Data Led

Data Is Not A Team To Hire or Set of Tools To Implement

Too many organizations think they just need to adopt a new technology or grow the data team to fix their data needs. So often the sources of data problems lay outside of data team boundaries.

 

Strategy → Stage → Team → Tools

Strategy - What are your points of leverage? How does data improve those points of leverage?

Stage - What stage of maturity is our product in? What stage of maturity is our Data in?

Team - What people do we need to achieve the data strategy? Are they set up for success internally?

Tools - What tools do we need to adopt to facilitate the team's impact?

 

The 3 Stages of Data Maturity

Stage 1: Data Informed - The key business need is for data to provide operational visibility.

Stage 2: Data Driven - The key business need is for data to support the organization’s growth with scalable tooling, data products, and deep-dive insights.

Stage 3: Data Led - The key business need is the “productization” of data services that unlock Product and Data Science teams, allowing them to automate operational decision-making and user product experiences.

 

You should read full article here!


Valdas Maksimavičius

IT Architect & Microsoft Data Platform MVP

https://www.dataplatformschool.com 

Vilnius
Lithuania

This email was sent to | Unsubscribe | Forward this email to a friend