Examples

Credit Scoring

See how you can use Aligned to load, process, train, and serve a credit scoring model.


For this project we will combine three differnet sources of features, or FeatureViews.

View the full code at GitHub.

Defining our features

Zipcode

We have also some location features based on a zipcode. The features are stored in a Parquet file in the following schime schema.

Column nameData type
zipcodeInt
cityString
stateString
location_typeString
tax_returns_filedInt
populationInt
total_wagesInt
event_timestampDatetime
created_timestampDatetime
from aligned import feature_view, String, Int64, EventTimestamp, Timestamp, FileSource, RedshiftSQLConfig, KafkaConfig
from datetime import timedelta

zipcode_source = FileSource.parquet_at("data/zipcode_table.parquet")

@feature_view(
    name="zipcode_features",
    description="Zipcode features for a given location",
    batch_source=zipcode_source,
)
class Zipcode:

    zipcode = Int64().as_entity()

    event_timestamp = EventTimestamp(ttl=timedelta(days=3650))
    created_timestamp = Timestamp()

    city = String()
    state = String()
    location_type = String()
    tax_returns_filed = Int64()
    population = Int64()
    total_wages = Int64()

    is_primary_location = location_type == "PRIMARY"

Credit History

Finally we have some features regarding the persons credit history. They are also stored in a new Parquet file, with the following schema.

Column nameData type
dob_ssnString
credit_card_dueInt
mortgage_dueInt
student_loan_dueInt
vehicle_oan_dueInt
hard_pullsInt
missed_payments_2yInt
missed_payments_1yInt
missed_payments_6mInt
event_timestampDatetime
created_timestampDatetime
from aligned import feature_view, String, EventTimestamp, Int64, FileSource, RedshiftSQLConfig
from datetime import timedelta

credit_history_source = FileSource.parquet_at("data/credit_history.parquet")

@feature_view(
    name="credit_history",
    description="The credit history for a given person",
    batch_source=credit_history_source
)
class CreditHistory:

    dob_ssn = String().as_entity().description(
        "Date of birth and last four digits of social security number"
    )

    event_timestamp = EventTimestamp(ttl=timedelta(days=90))

    credit_card_due = Int64()
    mortgage_due = Int64()
    student_loan_due = Int64()
    vehicle_loan_due = Int64()
    hard_pulls = Int64()
    missed_payments_2y = Int64()
    missed_payments_1y = Int64()
    missed_payments_6m = Int64()
    bankruptcies = Int64()

Loan

First lets look at the value we want to predict, if a person got a loan or not.

This will be a boolean value stored in a loan_status in a Parquet file. Furthermore, here are some more features stored in the same data file.

Column nameData type
loan_statusBool
loan_idInt
dob_ssnString
zipcodeInt
person_ageInt
person_incomeInt
person_home_ownershipString
person_emp_lengthFloat
loan_intentString
loan_amntInt
event_timestampDatetime

Now let's describe this data using Aligned.

from aligned import feature_view, Int64, String, FileSource, EventTimestamp, Bool, Float

loan_source = FileSource.parquet_at("data/loan_table.parquet", mapping_keys={
    "loan_amnt": "loan_amount"
})

ownership_values = ['RENT', 'OWN', 'MORTGAGE', 'OTHER']
loan_intent_values = [
    "PERSONAL", "EDUCATION", 'MEDICAL', 'VENTURE', 'HOMEIMPROVEMENT', 'DEBTCONSOLIDATION'
]

@feature_view(
    name="loan",
    description="The granted loans",
    batch_source=loan_source
)
class Loan:

    loan_id = String().as_entity()

    event_timestamp = EventTimestamp()

    loan_status = Bool().description("If the loan was granted or not")

    person_age = Int64()
    person_income = Int64()

    person_home_ownership = String().accepted_values(ownership_values)
    person_home_ownership_ordinal = person_home_ownership.ordinal_categories(ownership_values)

    person_emp_length = Float().description(
        "The number of months the person has been employed in the current job"
    )

    loan_intent = String().accepted_values(loan_intent_values)
    loan_intent_ordinal = loan_intent.ordinal_categories(loan_intent_values)

    loan_amount = Int64()
    loan_int_rate = Float().description("The interest rate of the loan")

Defining the Model

Finaly, now that we have defined where our features are stored, and the processing we want. Now we can define which features our model will use.

First we need to import the feature views that we want to use. We then define that we want the loan_status to be the label for our credit_scoring model, and that it is a classification task.

from aligned import model_contract
from examples.credit_scoring.credit_history import CreditHistory
from examples.credit_scoring.zipcode import Zipcode
from examples.credit_scoring.loan import Loan

credit = CreditHistory()
zipcode = Zipcode()
loan = Loan()

@model_contract(
    name="credit_scoring",
    description="A model that do credit scoring",
    features=[
        credit.credit_card_due,
        credit.mortgage_due,
        credit.student_loan_due,
        credit.vehicle_loan_due,
        credit.hard_pulls,
        credit.missed_payments_1y,
        credit.missed_payments_2y,
        credit.missed_payments_6m,
        credit.bankruptcies,

        zipcode.city,
        zipcode.state,
        zipcode.is_primary_location,
        zipcode.tax_returns_filed,
        zipcode.total_wages,

        loan.person_age,
        loan.person_income,
        loan.person_emp_length,
        loan.person_home_ownership_ordinal,
        loan.loan_amount,
        loan.loan_int_rate,
        loan.loan_intent_ordinal
    ]
)
class CreditScoring:

    was_granted_loan = loan.loan_status.as_classification_target()

Training a model

To train a model can we easily load a training data set with the following few lines.

store = await FileSource.json_at("features.json").feature_store()

entities = FileSource.csv_at("training_entities.parquet")
training_data = await store.model("credit_scoring")\
    .with_targets()\
    .features_for(entities)\
    .to_pandas()

Now that we have some data. We train a model with ease. No need to define which features to use your self, it is handled for you.

from sklearn.tree imoprt DecisionTreeClassifier

classifier = DecisionTreeClassifier()

classifier.fit(training_data.input, training_data.target)

Serving a model

Furthermore, we can do something similar for serving our model. However, rather then using our bach source can we change to an online store.

This can be done with the following code.

from aligned import RedisConfig

online_store = store.with_source(
    RedisConfig(env_key="REDIS_URL)
)

entities = {
    "zipcode": [...], 
    "dob_ssn": [...], 
    "loan_id": [...]
}

online_job = online_store.model("credit_scoring")\
    .features_for(entities)

feature_columns = online_job.request_result.feature_columns
features = await online_job.to_pandas()

y = classifier.predict(
    features[feature_columns]
)
Previous
Data Lineage