Data Lineage

The provision of data lineage enhances our ability to understand and debug our data logic. Despite the utility of the documentation feature, there are additional reasons why data lineage is crucial for the functionality of Aligned.

Value gain

Feast, an open-source feature store, lacks data lineage features. Consequently, end-users are required to know the data lineage themselves, which can lead to common errors. The end-users might not comprehend the entire data lineage and data dependencies, which can cause issues when running feature transformations. For instance, if a feature assumes it has access to another feature that isn't loaded into memory, this could lead to run-time errors and crashes.

Furthermore, because each transformation returns a new type, and Python's dunder methods allow for a custom implementation, we can create a complex graph of dependencies that is largely transparent to the end-user. As a result, users can focus on business logic in an intuitive manner, while Aligned handles the compilation of dependency logic.

Therefore, we get the same functionality as DBT's ref feature, but with a higher level of detail and flexibility. This because we can both get the lineage for individual features, but also different sources of data, and for different types of storage. For instance could some features be stored in S3 while others in a DWH like Redshift, and lastly some in a production db like PostgreSQL.

However, there are even bigger advantages of this fine grained data dependency graph.

Feature Optimisation

A significant advantage of such a system lies in its potential to reduce the required computational power. By employing our dependency graph, we can decrease the necessary compute.

For instance, if we request the feature day_of_week, we only need to load the picked_up_at feature. This avoids the need to load extensive coordinate data or to compute an unrelated Euclidean distance.

Furthermore, Aligned can remove features from the computation graph at a system level, since Aligned knows which features to use in our models. Therefore, understanding which features are dead code.

This feature optimisation becomes particularly apparent in columnar databases, where we only select the absolute minimum data needed. This leads to reduced network traffic and faster read times.