Big Data

Navigating Data Modelling in Cloud Data Warehouses: Insights from Vivekkumar Muthukrishnan

By Angela Scott-Briggs

Posted on July 14, 2023

Navigating Data Modelling in Cloud Data Warehouses

Data modelling within cloud data warehouses is pivotal for the development and optimisation of analytical solutions. Effective data modelling facilitates the organisation of data structures, expedites queries and aggregations, and enhances overall data processing performance. Vivekkumar Muthukrishnan, a seasoned expert in Data Engineering, sheds light on this crucial topic.

Vivekkumar begins by highlighting the importance of platform selection, particularly emphasising the features and considerations associated with Snowflake and Redshift. He notes Snowflake’s unique architecture, which separates compute and storage. This approach allows users to scale resources as needed, providing flexibility and high performance for even the most complex queries. Redshift, on the other hand, uses a shared-nothing architecture where each node contains both compute and storage. This approach, while it has its advantages, can limit scalability and cause latency for complex queries.

When choosing between Snowflake and Redshift, it’s important to consider not only their basic features, but also the specific needs and goals of your business. Each of these platforms has different features and benefits that can have a significant impact on your company’s data processing efficiency and analytics. When choosing between different platforms for your organisation, it is important to take into account not only the type of workload and budget, but also the level of security and technical infrastructure support.

Vivekkumar notes that depending on the specifics of the company’s business, it is also important to assess the extensibility of the chosen platform’s functionality, as well as its flexibility and ability to adapt to changing business needs. In addition, it is important to consider licensing and legal requirements to avoid problems in the future.

Moving on to data modelling principles, Vivekkumar outlines 5 key aspects essential for efficient data handling. The first is choosing the right schema. In cloud data warehouses, often the semantics of a star schema or star-with-links schema is used. This means that a fact table will have many attributes (columns) representing dimensions and metrics. These attributes can be connected to the dimensions through links that connect to the fact table. Choosing the right data schema can significantly improve query performance and simplify data analysis.

The second important principle is the separation of data into physical and logical layers. The physical level reflects the organisation of data on disk, while the logical level represents the data by its logical structure and the relationships between them. Proper division of data into levels allows you to effectively manage access to data and optimise queries.

Third is the use of optimised data types. Cloud data warehouses usually offer different data types for different purposes. For example, in Snowflake, you can use VARIANT type to store polymorphic data and ARRAY type to work with arrays of data. Choosing the right data types helps to reduce the amount of data stored and improve query performance.

The fourth principle is the use of data partitioning and clustering. These techniques allow you to partition data into logical and physical blocks, optimally distribute them across cluster nodes, and make queries more efficient. For example, in Redshift, you can use key attributes to partition data and improve performance.

Fifth is the use of indexes. Indexes can speed up data retrieval and filtering, improving query performance. In cloud data warehouses, you can use different types of indexes, such as a B-tree or a full-text search index, depending on the specific task at hand.

In detailing the steps to create an effective data model, Vivekkumar highlights the basics.

1) Define your goals and requirements:

The first step when modelling data in cloud data warehouses such as Snowflake and Redshift is to define your goals and requirements. Clearly define what information you want to get out of the data, what reports and analytical queries you plan to run, and how the data will be used in your organisation.

2) Design with scalability in mind:

When modelling data in cloud data warehouses, it is important to consider scalability. Make sure your data model will be flexible and capable of handling large amounts of data. Consider using data segmentation or partitioning to improve query performance.

3) Maintain normalisation:

Data normalisation is an important aspect when modelling data in cloud data warehouses. Try to minimise data redundancy and repeated values to reduce the size of data warehouse and improve query performance.

4) Use hierarchy structure:

It is recommended to use hierarchy structure while designing a data model. Identify the main entities of the data and establish relationships between them. This will help you to present the data in a logical and readable format and will make it easier to perform complex queries.

5) Develop aggregated tables:

To improve query performance, it is recommended to create aggregated tables. These tables contain pre-calculated and collapsed data, which speeds up the execution of queries, especially when dealing with large amounts of data.

6) Keep data up-to-date:

In cloud data warehousing, it is important to keep the data up-to-date. Regularly update data and eliminate duplicates to make sure your data model is up to date.

7) Test and optimise performance:

Don’t forget to test and optimise the performance of your data model. Perform compliance and query performance testing to identify possible improvements.

Vivekkumar also gives one successful example of data modelling in Snowflake, which is the creation of a star schema for a company’s sales analytics report. In this model, all data is reduced to a central fact table that contains keys of different measurement data such as products, customers and time periods. The different measurements are linked to the fact table using external keys. This makes it easy to analyse data and create reports within this schema.

Vivekkumar’s Redshift example involves storing website visit logs using a fact-measurement schema, where each record in the fact table represents a different event, such as a visit to a particular page or a user taking a particular action. The adjacent dimension tables contain additional data such as information about the user, the time period and the event itself. This model allows for efficient analysis of visit data and tracking of user behaviour. In both examples, the key principles of data modelling are normalisation and the use of foreign keys to link tables. Normalisation avoids repetitive data and enables more efficient storage and processing of information.

The use of foreign keys facilitates the linking of data between tables and enables analytical queries and data aggregation. It is important to note that successful data modelling in cloud data warehouses also requires platform-specific considerations. For example, Redshift has its own performance optimisations that need to be considered when designing a model. Snowflake, on the other hand, offers unique features such as auto-scaling and multiple user support. These examples of successful data modelling in Snowflake and Redshift, according to Vivekkumar, can serve as a starting point for building your own cloud storage data models.

In conclusion Vivekkumar offers insightful recommendations, advocating for proper planning, platform alignment, and regular statistics updates to streamline data modelling efforts. His expertise serves as a valuable guide for navigating the intricacies of data modelling in cloud data warehouses, empowering organisations to harness the full potential of their data analytics capabilities.