Doing multiple things at once and time deltas#

In the first tutorial, we applied 1 aggregation function with 1 lookperiod to 1 value at a time. In reality, you’d likely want to create features for the same value at multiple lookbehind intervals and with multiple aggregation functions. Additionally, some common features, such as the current age, require slightly different handling than what we’ve covered so far.

Luckily, timeseriesflattener makes all of this extremely simple.

This tutorial covers:

  • Multiple aggregations, lookdistances, and features at once.

  • Creating features based on a timedelta.

Multiple aggregation functions and lookperiods#

To apply multiple aggregation functions across all combinations of lookperiods, you just have to supply more aggregators and lookperiods to the list! Let’s see an example. We start off by loading the prediction times and a dataframe containing the values we wish to aggregate.

Note: The interface for creating multiple featuers at once works in exactly the same way for StaticSpec as for PredictorSpec. We will use PredictorSpec for illustration.

# Load the prediction times
from __future__ import annotations

from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

df_prediction_times = load_synth_prediction_times()
# load values for a temporal predictor
from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float

df_synth_predictors = load_synth_predictor_float()
df_synth_predictors.head()
shape: (5, 3)
entity_idtimestampvalue
i64datetime[μs]f64
94761969-03-05 08:08:000.816995
46311967-04-10 22:48:004.818074
38901969-12-15 14:07:002.503789
10981965-11-19 03:53:003.515041
16261966-05-03 14:07:004.353115

Note that df_synth_predictors is not sorted by entity_id, but there are multiple values per id.

Time to make a spec.

import datetime as dt

import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import LatestAggregator, MeanAggregator


# helper function to create tuples of timedelta interval
def make_timedelta_interval(start_days: int, end_days: int) -> tuple[dt.timedelta, dt.timedelta]:
    return (dt.timedelta(days=start_days), dt.timedelta(days=end_days))


predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        init_df=df_synth_predictors,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    aggregators=[MeanAggregator(), LatestAggregator(timestamp_col_name="timestamp")],
    lookbehind_distances=[
        make_timedelta_interval(0, 30),
        make_timedelta_interval(30, 365),
        make_timedelta_interval(365, 730),
    ],
    fallback=np.nan,
)

Let’s break it down. We supply two aggregators, MeanAggregator and LatestAggregator which will be applied across all lookbehind distances, which we specified as 3 intervals: 0-30 days, 30-365 days, and 365-730 days. We therefore expected to make n_aggregators * n_lookbehind_distances = 2 * 3 = 6 features from this single column. Let’s flatten the data and see!

from timeseriesflattener import Flattener, PredictionTimeFrame

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=df_prediction_times, entity_id_col_name="entity_id", timestamp_col_name="timestamp"
    )
)

df = flattener.aggregate_timeseries(specs=[predictor_spec]).df
Processing spec: ['value']


df.head()
shape: (5, 9)
entity_idtimestampprediction_time_uuidpred_value_within_0_to_30_days_mean_fallback_nanpred_value_within_0_to_30_days_latest_fallback_nanpred_value_within_30_to_365_days_mean_fallback_nanpred_value_within_30_to_365_days_latest_fallback_nanpred_value_within_365_to_730_days_mean_fallback_nanpred_value_within_365_to_730_days_latest_fallback_nan
i64datetime[μs]strf64f64f64f64f64f64
98521965-01-02 09:35:00"9852-1965-01-02 09:35:00.00000…NaNNaNNaNNaNNaNNaN
14671965-01-02 10:05:00"1467-1965-01-02 10:05:00.00000…NaNNaNNaNNaNNaNNaN
11251965-01-02 12:55:00"1125-1965-01-02 12:55:00.00000…NaNNaNNaNNaNNaNNaN
6491965-01-02 14:01:00"649-1965-01-02 14:01:00.000000"NaNNaNNaNNaNNaNNaN
20701965-01-03 08:01:00"2070-1965-01-03 08:01:00.00000…NaNNaNNaNNaNNaNNaN

And that’s what we get!

Multiple values from the same dataframe#

Sometimes, you might have values measured at the same time which you want to aggregate in the same manner. In timeseriesflattener this is handled by simply having multiple columns in the dataframe you pass to ValueFrame. Let’s see an example.

import polars as pl

# add a new column to df_synth_predictors to simulate a new predictor measured at the same time
df_synth_predictors = df_synth_predictors.with_columns(
    new_predictor=pl.Series(np.random.rand(df_synth_predictors.shape[0]))
)
df_synth_predictors.head()
shape: (5, 4)
entity_idtimestampvaluenew_predictor
i64datetime[μs]f64f64
94761969-03-05 08:08:000.8169950.38953
46311967-04-10 22:48:004.8180740.92491
38901969-12-15 14:07:002.5037890.362241
10981965-11-19 03:53:003.5150410.922223
16261966-05-03 14:07:004.3531150.331179

We make a PredictorSpec similar to above. Let’s try some new aggregators.

from timeseriesflattener.aggregators import MinAggregator, SlopeAggregator

# create a new predictor spec
predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        init_df=df_synth_predictors,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    aggregators=[MinAggregator(), SlopeAggregator(timestamp_col_name="timestamp")],
    lookbehind_distances=[
        make_timedelta_interval(0, 30),
        make_timedelta_interval(30, 365),
        make_timedelta_interval(365, 730),
    ],
    fallback=np.nan,
)

Now, all allgregators will be applied to each predictor column for each lookbehind distance. Therefore, we expect to make n_predictors * n_aggregators * n_lookbehind_distances = 2 * 2 * 3 = 12 features with this spec.

df = flattener.aggregate_timeseries(specs=[predictor_spec]).df

df.head()
Processing spec: ['value', 'new_predictor']


shape: (5, 15)
entity_idtimestampprediction_time_uuidpred_value_within_0_to_30_days_min_fallback_nanpred_new_predictor_within_0_to_30_days_min_fallback_nanpred_value_within_0_to_30_days_slope_fallback_nanpred_new_predictor_within_0_to_30_days_slope_fallback_nanpred_value_within_30_to_365_days_min_fallback_nanpred_new_predictor_within_30_to_365_days_min_fallback_nanpred_value_within_30_to_365_days_slope_fallback_nanpred_new_predictor_within_30_to_365_days_slope_fallback_nanpred_value_within_365_to_730_days_min_fallback_nanpred_new_predictor_within_365_to_730_days_min_fallback_nanpred_value_within_365_to_730_days_slope_fallback_nanpred_new_predictor_within_365_to_730_days_slope_fallback_nan
i64datetime[μs]strf64f64f64f64f64f64f64f64f64f64f64f64
98521965-01-02 09:35:00"9852-1965-01-02 09:35:00.00000…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14671965-01-02 10:05:00"1467-1965-01-02 10:05:00.00000…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11251965-01-02 12:55:00"1125-1965-01-02 12:55:00.00000…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
6491965-01-02 14:01:00"649-1965-01-02 14:01:00.000000"NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
20701965-01-03 08:01:00"2070-1965-01-03 08:01:00.00000…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

TimeDelta features#

An example of a commonly used feature that requires slightly different handling than what we’ve seen so far is age. Calculating the age at the prediction time requires us to calculate a time delta between the prediction time and a dataframe containing birthdate timestamps. To do this, we can use the TimeDeltaSpec. First, we load a dataframe containing the date of birth for each entity in our dataset.

from timeseriesflattener.testing.load_synth_data import load_synth_birthdays

df_birthdays = load_synth_birthdays()
df_birthdays.head()
shape: (5, 2)
entity_idbirthday
i64datetime[μs]
90451932-10-24 03:16:00
55321920-12-09 09:41:00
22421917-03-20 17:00:00
7891930-02-15 06:51:00
97151926-08-18 08:35:00

df_birthdays is a dataframe containing a single value for each entity_id with is their date of birth. Time to make a spec.

from timeseriesflattener import TimeDeltaSpec, TimestampValueFrame

age_spec = TimeDeltaSpec(
    init_frame=TimestampValueFrame(
        init_df=df_birthdays, entity_id_col_name="entity_id", value_timestamp_col_name="birthday"
    ),
    fallback=np.nan,
    output_name="age",
    time_format="years",  # can be ["seconds", "minutes", "hours", "days", "years"]
)

To make the TimeDeltaSpec, we define a TimestampValueFrame where we specify the column containing the entity id, and column containing the timestamps. Fallback is used to set the value for entities without an entry in the TimestampValueFrame, output_name determines the name of the output column, and time_format specifies which format the output should take. Time to make features!

df = flattener.aggregate_timeseries(specs=[age_spec]).df

df.head()
Processing spec: ['age']


shape: (5, 4)
entity_idtimestampprediction_time_uuidpred_age_years_fallback_nan
i64datetime[μs]strf64
98521965-01-02 09:35:00"9852-1965-01-02 09:35:00.00000…31.909651
14671965-01-02 10:05:00"1467-1965-01-02 10:05:00.00000…22.9295
11251965-01-02 12:55:00"1125-1965-01-02 12:55:00.00000…35.570157
6491965-01-02 14:01:00"649-1965-01-02 14:01:00.000000"27.093771
20701965-01-03 08:01:00"2070-1965-01-03 08:01:00.00000…49.886379

Let’s see the values for a random entity to make sure they differ by the timestamp of the prediction time.

import polars as pl

df.filter(pl.col("entity_id") == 9903)
shape: (2, 4)
entity_idtimestampprediction_time_uuidpred_age_years_fallback_nan
i64datetime[μs]strf64
99031965-11-14 00:33:00"9903-1965-11-14 00:33:00.00000…36.670773
99031968-05-09 21:24:00"9903-1968-05-09 21:24:00.00000…39.154004