Finding the right path for data engineering
We have dedicated a considerable amount of time to comprehending the data engineering capabilities of Azure and constructing a robust data infrastructure that handles our raw video ingestion, cleansed video, and the corresponding context necessary to correlate the data and feed it into the AI model. As each design is distinctive and cannot be duplicated, this process involved much experimentation and refinement. Drawing from our experience, we have compiled a set of valuable insights.
An Azure Data Lake Storage Gen2 account is built on top of Azure Blob storage, which was the first iteration of this technology. It offers many storage benefits, including low cost and various tiers. The new Gen2 account combines the features of Azure Blob Storage and Azure Data Lake Storage Gen1, providing additional capabilities such as file system semantics, file-level security, and scalability. This dramatically enhances analytics, particularly in terms of performance, as data does not need to be copied or transformed before analysis, unlike with the flat namespace of Blob storage. The hierarchical namespace of Gen2 also improves management, allowing for easy organization of files into directories and subdirectories. Additionally, it enhances security by enabling POSIX permissions on directories or individual files. Ultimately, the hierarchical namespace sets Gen2 apart and makes it an ideal analytics solution.
All data types come to us from various sources, including logs, media files, and structured/unstructured data from relational databases. To ingest this data, we utilize Azure Data Factory. Subsequently, we store this data in Azure Data Lake Storage Gen2, an exceptional repository for our analytics solution. From there, it becomes the focal point from which all the action begins. Azure Databricks may be used for preparation and training, and downstream services such as Azure Synapse Analytics, Azure Analysis Services, or Power BI may be included in the model and serve stage. However, all these services rely on Azure Data Lake Storage Gen2. Information may be extracted from the data lake, prepped and trained in Databricks, and transmitted to Synapse. Alternatively, Synapse may pull data directly from our data lake. Regardless of what services you add to your analytics solution, such as Cosmos DB or Stream Analytics, the data hub tying everything together is the Azure data lake.
Zones
A “zone” refers to a folder, typically ranging from three to five in number, although the specific number and arrangement may vary depending on the implementation. While some zones may be optional, most performances include a landing zone, where raw data is initially stored in its immutable state. Although the data may not always be fit for consumption, it can be organized according to its source in separate folders. However, this zone can become expensive due to the complete data set from different sources. Thus, it is recommended to periodically move the data to a more excellent access tier through a programmatic process.
Next is the staged zone, which marks the first step toward refining the data by providing a basic structure. Typically, raw data is automatically processed to offer an improved form to prepare it for curation. Even at this stage, the data provides more excellent value than the raw data. The curated zone follows, where data is transformed into consumable datasets such as tables or files. While this zone can feed a data warehouse, it cannot replace one as it is unsuitable for end-user dashboards or reports due to slower speeds. Instead, this zone is ideal for internal analytics with no time requirements, such as ad hoc queries. However, it differs from the staging zone as it only contains consumer data quality checked and combined with like sources. The curated data is then utilized to feed the production zone.
Zone refers to a directory in the data lake containing specific data at a particular stage. Typically, there are 3–5 zones, but the naming and order may vary across organizations. The first is the landing zone, where raw and immutable data lands in its original state. This zone can become expensive due to the large amount of data collected, and therefore, the data should be moved programmatically to a cool access tier at set intervals. The next stage is the staged zone, where basic structures are added to the data to prepare it for curation. After that is the curated zone, where data is transformed into consumable datasets, and data quality is checked before feeding into the production zone. The production zone provides easy access to data consumers and includes business logic like surrogate keys for specific applications. The experimental location is a sandbox for data scientists to experiment with, combining multiple datasets within and outside the data lake. It is essential to understand the philosophy behind each zone, even though organizations may differ in their specific zones and how they are named. The most common zones are landing, curated, and production, which almost every analytic solution has. Other zones can add extra features to these primary zones.
After establishing the zone structure, the next step is to develop a folder structure. Here are some strategies to consider:
- The naming convention should be human-readable and self-documenting. Please keep it simple and easily understandable. Avoid over-complicating the folder structure, making it harder to scale and manage.
- Ensure effective permissions without creating unnecessary maintenance overhead. Avoid building too many sub-directories that require managing too many permissions.
- Partition thoughtfully. Align the partition strategy with the purpose of the zone you are working in. For instance, aim for optimal retrieval in the curated area, and feel free to have a different strategy across all zones.
- Design intentionally.
- Group similar items together. Folders should contain files of the same schema and format to make it easier to work with them consistently.
Hierarchical namespace benefit
It’s essential to start envisioning how to create our Azure data lake, store our files, and set it up as the central hub of our analytical solution. The advantages of having a hierarchical namespace and efficient management of different zones and files within those zones are immense. With Azure Data Lake Storage, we can manage access control and ACLs per folder and file, which is a significant improvement over traditional blob storage. Additionally, we can contain networking, shared access signatures, and other security features, which we will discuss in detail in a future lesson.
Azure Data Lake Storage Gen2 is an excellent solution for analytical workloads or when a human-readable hierarchy is necessary. It combines the benefits of Azure Blob storage and ADLs Gen1 and is a fundamental component of almost every Azure analytics solution. So, you will be revisiting this frequently, no matter how you construct your answer. Finally, please remember that security is crucial, and enabling the firewall and restricting access to other Azure services is recommended. However, we will delve deeper into security later.
An Example, folder structure strategy
We can structure our data lake based on various criteria, such as the source system, departments, or projects, depending on what makes the most sense for our business needs. For example, we can create folders for each data source under the source system filter in our raw zone. This allows writing permissions to be granted to source systems at the data source level, with default ACLs inherited by new folders created under them. In the production zone, we can create folders for different departments, such as sales, marketing, or internal IT, and associate various systems with other application logic to these folders. We can also use date-based filtering, dividing the folders into years, months, and days, with the day being the final folder that holds the actual files. By combining these filtering strategies, our folder structure may look like this:
Let’s consider how we can structure our folders by filtering through source systems, departments, projects, or any other entity that makes sense for your business. For instance, within our raw zone, we can have a folder for each data source. This enables us to grant write permissions at the data source level with default ACLs, which will be inherited by all new folders underneath. Similarly, we can divide our production zone into sales, marketing, or internal IT departments with different application logic for each folder structure. Dates can also be used to organize files, with the year, month, and day structure at the end of the folder hierarchy. Setting the data source level for permissions inheritance is essential to make maintenance easier.
Additionally, sensitivity levels can separate general data from sensitive data in different zones. The folder structure can impact query performance, security, data pruning, and administrative maintenance. Therefore, thoughtful folder structure decisions up front can be beneficial down the road. Remember that each business has unique analytical needs, so the solution must be tailored accordingly.