Introductory Tutorial#

TimeseriesFlattener flattens timeseries. This is especially helpful if you have complicated and irregular time series but want to train simple models.

We explain terminology as needed in this tutorial. If you need a reference, see the docs.

Applying it consists of 3 steps:

  1. Loading data (prediction times, predictor(s), and outcome(s))

  2. Specifying how to flatten the data and

  3. Flattening

The simplest case is adding one predictor and one outcome.

First, we’ll load the timestamps for every time we want to issue a prediction:

Loading data#

Loading prediction times#

Prediction times consist of two elements:

  1. The entity id. This is the entity about which the prediction is issued. In medical contexts, this is frequently a patient.

  2. The timestamp at which the prediction is to be issued.

from __future__ import annotations

from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

df_prediction_times = load_synth_prediction_times()

skim(df_prediction_times)
df_prediction_times.sort(["entity_id"])
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 10000  │ │ int64       │ 1     │                                                          │
│ │ Number of columns │ 2      │ │ datetime64  │ 1     │                                                          │
│ └───────────────────┴────────┘ └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %     mean     sd      p0    p25     p50     p75     p100    hist     ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩  │
│ │ entity_id          0      0   4959  2886   0  2485  4922  7443  9999 ▇▇▇▇▇▇  │  │
│ └──────────────────┴──────┴─────────┴─────────┴────────┴──────┴────────┴────────┴────────┴────────┴──────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %     first                       last                        frequency    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
│ │ timestamp          0      0   1965-01-02 09:35:00       1969-12-31 21:42:00    None         │  │
│ └──────────────────┴──────┴─────────┴────────────────────────────┴────────────────────────────┴──────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
shape: (10_000, 2)
entity_idtimestamp
i64datetime[μs]
01969-01-11 09:55:00
11965-03-15 07:16:00
21969-09-13 23:18:00
31968-02-04 16:16:00
41965-01-28 12:33:00
99961965-01-30 17:19:00
99961965-07-18 17:12:00
99971967-06-08 07:52:00
99991965-07-19 14:59:00
99991968-02-07 22:24:00

Here, “entity_id” represents a patient ID and “timestamp” refers to the time when we want to issue a prediction. Note that each ID can have multiple prediction times.

Loading a temporal predictor#

Then, we’ll load the values for our temporal predictor. Temporal predictors are predictors that can have a different value at different timepoints.

from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float

df_synth_predictors = load_synth_predictor_float()

skim(df_synth_predictors)
df_synth_predictors.sort(["entity_id"])
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 100000 │ │ int64       │ 1     │                                                          │
│ │ Number of columns │ 3      │ │ datetime64  │ 1     │                                                          │
│ └───────────────────┴────────┘ │ float64     │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name     NA   NA %   mean     sd       p0           p25     p50     p75     p100   hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━┩  │
│ │ entity_id       0    0   4994   2887          0  2486  4996  7487 9999▇▇▇▇▇▇ │  │
│ │ value           0    0  4.983  2.885  0.0001514 2.483 4.975 7.486   10▇▇▇▇▇▇ │  │
│ └────────────────┴─────┴───────┴─────────┴─────────┴─────────────┴────────┴────────┴────────┴───────┴────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %     first                       last                        frequency    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
│ │ timestamp          0      0   1965-01-02 00:01:00       1969-12-31 23:37:00    None         │  │
│ └──────────────────┴──────┴─────────┴────────────────────────────┴────────────────────────────┴──────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
shape: (100_000, 3)
entity_idtimestampvalue
i64datetime[μs]f64
01967-06-12 14:06:000.174793
01968-04-15 01:45:003.072293
01968-12-09 05:42:001.315754
01969-06-20 18:07:002.812481
01967-11-26 01:59:002.981185
99991968-08-19 10:15:000.671907
99991966-01-03 22:34:004.158796
99991966-06-27 10:55:004.414455
99991968-04-02 12:58:001.552491
99991969-06-24 07:19:004.501553

Once again, note that there can be multiple values for each ID.

Loading a static predictor#

Frequently, you’ll have one or more static predictors describing each entity. In this case, an entity is a patient, and an example of a static outcome could be their sex. It doesn’t change over time (it’s static), but can be used as a predictor for each prediction time. Let’s load it in!

from timeseriesflattener.testing.load_synth_data import load_synth_sex

df_synth_sex = load_synth_sex()

skim(df_synth_sex)
df_synth_sex
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 9999   │ │ int64       │ 2     │                                                          │
│ │ Number of columns │ 2      │ └─────────────┴───────┘                                                          │
│ └───────────────────┴────────┘                                                                                  │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %    mean       sd      p0    p25     p50     p75     p100    hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩  │
│ │ entity_id          0     0     4999  2887   0  2500  4999  7500  9999▇▇▇▇▇▇  │  │
│ │ female             0     0   0.4984   0.5   0     0     0     1     1▇    ▇  │  │
│ └──────────────────┴──────┴────────┴───────────┴────────┴──────┴────────┴────────┴────────┴────────┴─────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
shape: (9_999, 2)
entity_idfemale
i64i64
00
11
21
31
40
99950
99960
99971
99981
99990

As the predictor is static, there should only be a single value for each ID in this dataframe.

Loading a temporal outcome#

And, lastly, our outcome values. We’ve chosen a binary outcome and only stored values for the timestamps that experience the outcome. From these, we can infer patients that do not experience the outcome, since they do not have a timestamp. We handle this by setting a fallback of 0 - more on that in the following section.

from timeseriesflattener.testing.load_synth_data import load_synth_outcome

df_synth_outcome = load_synth_outcome()

skim(df_synth_outcome)
df_synth_outcome
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 3103   │ │ int64       │ 2     │                                                          │
│ │ Number of columns │ 3      │ │ datetime64  │ 1     │                                                          │
│ └───────────────────┴────────┘ └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %     mean     sd      p0    p25     p50     p75     p100    hist     ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩  │
│ │ entity_id          0      0   5032  2900   4  2499  5109  7555  9992 ▇▇▇▇▇▇  │  │
│ │ value              0      0      1     0   1     1     1     1     1 │  │
│ └──────────────────┴──────┴─────────┴─────────┴────────┴──────┴────────┴────────┴────────┴────────┴──────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %     first                       last                        frequency    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
│ │ timestamp          0      0   1965-01-04 07:50:00       1969-12-31 10:15:00    None         │  │
│ └──────────────────┴──────┴─────────┴────────────────────────────┴────────────────────────────┴──────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
shape: (3_103, 3)
entity_idtimestampvalue
i64datetime[μs]i64
13251969-03-01 12:24:001
4261966-12-02 21:38:001
92211967-09-26 01:01:001
47581968-10-20 01:03:001
8041967-03-22 20:51:001
30481968-08-19 15:49:001
76751965-12-21 04:21:001
52841965-03-23 00:49:001
21611969-02-23 22:09:001
83241966-07-23 14:47:001

This dataframe should contain at most 1 row per ID, which is the first time they experience the outcome.

We now have 4 dataframes loaded: df_prediction_times, df_synth_predictors, df_synth_sex and df_synth_outcome.

Specifying how to flatten the data#

We’ll have to specify how to flatten predictors and outcomes. To do this, we use the feature specification objects as “recipes” for each column in our finished dataframe. Firstly, we’ll specify the outcome specification.

Temporal outcome specification#

The main decision to make for outcomes is the size of the lookahead window. It determines how far into the future from a given prediction time to look for outcome values. A prediction time indicates at which point the model issues a prediction, and is used as a reference for the lookahead.

We want labels for prediction times to be 0 if the outcome never occurs, or if the outcome happens outside the lookahead window. Labels should only be 1 if the outcome occurs inside the lookahead window. Let’s specify this in code.

import datetime as dt

import pandas as pd
from timeseriesflattener import BooleanOutcomeSpec, TimestampValueFrame, ValueFrame
from timeseriesflattener.aggregators import MaxAggregator

test_df = pd.DataFrame({"entity_id": [0], "timestamp": [pd.Timestamp("2020-01-01")]})

outcome_spec = BooleanOutcomeSpec.from_primitives(
    df=test_df,
    entity_id_col_name="entity_id",
    value_timestamp_col_name="timestamp",
    lookahead_days=[365],
    aggregators=["max"],
    column_prefix="outc",
)

# Alternatively, if you prefer types
outcome_spec = BooleanOutcomeSpec(
    init_frame=TimestampValueFrame(
        entity_id_col_name="entity_id", init_df=test_df, value_timestamp_col_name="timestamp"
    ),
    lookahead_distances=[dt.timedelta(days=365)],
    aggregators=[MaxAggregator()],
    output_name="outcome",
    column_prefix="outc",
)

Since our outcome is binary, we want each prediction time to be labeled with 0 for the outcome if none is present within lookahead days. To do this, we use the fallback argument, which specifies the default value to use if none are found in values_df within lookahead. For the BooleanOutcomeSpec, this is hardcoded to 0.

Your use case determines how you want to handle multiple outcome values within lookahead days. In this case, we decide that any prediction time with at least one outcome (a timestamp in the loaded outcome data with a corresponding value of 1) within the specified lookahead days is “positive”. I.e., if there is both a 0 and a 1 within lookahead days, the prediction time should be labeled with a 1. We set aggregators = [MaxAggregator()] to accomplish this.

Here, we specifiy that we want to look 365 days forward from the prediction time to search for outcomes. If we wanted to require a certain period of time from the prediction time before we look for outcome values, we can specify lookahead as an interval of (min_days, max_days) as a tuple instead.

Lastly, we specify a name of the outcome which’ll be used when generating its column.

Temporal predictor specification#

Specifying a predictor is almost entirely identical to specifying an outcome. The only exception is that it looks a given number of days into the past from each prediction time instead of ahead.

import numpy as np
from timeseriesflattener import PredictorSpec, StaticSpec
from timeseriesflattener.aggregators import MeanAggregator

temporal_predictor_spec = PredictorSpec.from_primitives(
    df=df_synth_predictors.rename({"value": "value_1"}),
    entity_id_col_name="entity_id",
    value_timestamp_col_name="timestamp",
    aggregators=["mean"],
    column_prefix="pred",
    fallback=np.nan,
    lookbehind_days=[730],
)

# Alternatively, if you prefer types

temporal_predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        entity_id_col_name="entity_id",
        init_df=df_synth_predictors.rename({"value": "value_1"}),
        value_timestamp_col_name="timestamp",
    ),
    aggregators=[MeanAggregator()],
    column_prefix="pred",
    fallback=np.nan,
    lookbehind_distances=[dt.timedelta(days=730)],
)

Values within the lookbehind window are aggregated using aggregators, for example the mean as shown in this example, or max/min etc.

Note that we rename the value column to value_1. The value column’s name determines the name of the output column after aggregation. To avoid multiple output columns with the same name, all input value columns must have unique names.

Temporal predictors can also be specified to look for values within a certain time range from the prediction time, similar to outcome specifications. For instance, you might want to create multiple predictors, where one looks for values within (0, 30) days, and another within (31, 182) days.

This can easily be specified by passing a tuple[min_days, max_days] to the lookbehind_days parameter.

temporal_interval_predictor_spec = PredictorSpec.from_primitives(
    df=df_synth_predictors.rename({"value": "value_2"}),
    entity_id_col_name="entity_id",
    value_timestamp_col_name="timestamp",
    aggregators=["mean"],
    column_prefix="pred",
    fallback=np.nan,
    lookbehind_days=[(10, 365)],
)

# Alternatively, if you prefer types
temporal_interval_predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        entity_id_col_name="entity_id",
        init_df=df_synth_predictors.rename({"value": "value_2"}),
        value_timestamp_col_name="timestamp",
    ),
    aggregators=[MeanAggregator()],
    column_prefix="pred",
    fallback=np.nan,
    lookbehind_distances=[(dt.timedelta(days=10), dt.timedelta(days=365))],
)

Static predictor specification#

Static features should be specified using StaticSpec as they are handled slightly differently. As in the previous specifications, we provide a values_df containing the values and we set the feature name. However, now we also add a prefix. By default, PredictorSpec prefixes columns with “pred” and OutcomeSpec prefixes columns with “outc” to make filtering easy. As StaticSpec can be used for both generating predictors and outcomes, we manually set the prefix to be “pred”, as sex is used as predictor in this case.

from timeseriesflattener import StaticFrame

sex_predictor_spec = StaticSpec.from_primitives(
    df=df_synth_sex, entity_id_col_name="entity_id", column_prefix="pred", fallback=np.nan
)

# Alternatively, if you prefer types
sex_predictor_spec = StaticSpec(
    value_frame=StaticFrame(init_df=df_synth_sex), column_prefix="pred", fallback=np.nan
)

df_synth_sex
shape: (9_999, 2)
entity_idfemale
i64i64
00
11
21
31
40
99950
99960
99971
99981
99990

Note that we don’t need to specify which columns to aggregate. Timeseriesflattener aggregates all columns that are not entity_id_col_name or value_timestamp_col_name and uses the name(s) of the column(s) for the output.

Now we’re ready to flatten our dataset!

Flattening#

Flattening is as easy as instantiating the TimeseriesFlattener class with the prediction times df along with dataset specific metadata and calling the add_* functions. n_workers can be set to parallelize operations across multiple cores.

from timeseriesflattener import Flattener, PredictionTimeFrame

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=df_prediction_times, entity_id_col_name="entity_id", timestamp_col_name="timestamp"
    )
)
df = flattener.aggregate_timeseries(
    specs=[
        sex_predictor_spec,
        temporal_predictor_spec,
        temporal_interval_predictor_spec,
        outcome_spec,
    ]
).df

skim(df)

list(df.columns)
Processing spec: ['female']
Processing spec: ['value_1']
Processing spec: ['value_2']
Processing spec: ['outcome']

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 10000  │ │ int64       │ 3     │                                                          │
│ │ Number of columns │ 7      │ │ float64     │ 2     │                                                          │
│ └───────────────────┴────────┘ │ datetime64  │ 1     │                                                          │
│                                │ string      │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name            NA    NA %   mean    sd     p0         p25    p50    p75    p100   hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━┩  │
│ │ entity_id               0    0  4959 2886        0 2485 4922 7443 9999▇▇▇▇▇▇ │  │
│ │ pred_female_fallback_   0    00.4931  0.5        0    0    0    1    1▇    ▇ │  │
│ │ nan                   │      │       │        │       │           │       │       │       │       │        │  │
│ │ pred_value_1_within_0107210.72  5.011.842  0.014913.8515.0236.1789.946▁▃▇▇▃▁ │  │
│ │ _to_730_days_mean_fal │      │       │        │       │           │       │       │       │       │        │  │
│ │ lback_nan             │      │       │        │       │           │       │       │       │       │        │  │
│ │ pred_value_2_within_12060 20.6 5.0082.2220.0003901 3.525.014 6.569.997▂▅▇▇▅▂ │  │
│ │ 0_to_365_days_mean_fa │      │       │        │       │           │       │       │       │       │        │  │
│ │ llback_nan            │      │       │        │       │           │       │       │       │       │        │  │
│ │ outc_outcome_within_0   0    0     0    0        0    0    0    0    0 │  │
│ │ _to_365_days_max_fall │      │       │        │       │           │       │       │       │       │        │  │
│ │ back_0                │      │       │        │       │           │       │       │       │       │        │  │
│ └───────────────────────┴──────┴───────┴────────┴───────┴───────────┴───────┴───────┴───────┴───────┴────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
│ ┃ column_name       NA    NA %     first                       last                        frequency    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
│ │ timestamp          0      0   1965-01-02 09:35:00       1969-12-31 21:42:00    None         │  │
│ └──────────────────┴──────┴─────────┴────────────────────────────┴────────────────────────────┴──────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                            NA     NA %       words per row             total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ prediction_time_uuid                     0        0                       2              20000 │  │
│ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
['entity_id',
 'timestamp',
 'prediction_time_uuid',
 'pred_female_fallback_nan',
 'pred_value_1_within_0_to_730_days_mean_fallback_nan',
 'pred_value_2_within_10_to_365_days_mean_fallback_nan',
 'outc_outcome_within_0_to_365_days_max_fallback_0']
# For displayability, shorten col names
shortened_pred = "predX"
shortened_predinterval = "predX_30_to_90"
shortened_outcome = "outc_Y"

display_df = df.rename(
    {
        "pred_value_1_within_0_to_730_days_mean_fallback_nan": shortened_pred,
        "pred_value_2_within_10_to_365_days_mean_fallback_nan": shortened_predinterval,
        "outc_outcome_within_0_to_365_days_max_fallback_0": shortened_outcome,
    }
)
display_df
shape: (10_000, 7)
entity_idtimestampprediction_time_uuidpred_female_fallback_nanpredXpredX_30_to_90outc_Y
i64datetime[μs]stri64f64f64i32
98521965-01-02 09:35:00"9852-1965-01-02 09:35:00.00000…1NaNNaN0
14671965-01-02 10:05:00"1467-1965-01-02 10:05:00.00000…0NaNNaN0
11251965-01-02 12:55:00"1125-1965-01-02 12:55:00.00000…0NaNNaN0
6491965-01-02 14:01:00"649-1965-01-02 14:01:00.000000"0NaNNaN0
20701965-01-03 08:01:00"2070-1965-01-03 08:01:00.00000…1NaNNaN0
3341969-12-31 16:32:00"334-1969-12-31 16:32:00.000000"17.252526NaN0
33631969-12-31 17:52:00"3363-1969-12-31 17:52:00.00000…13.6796672.4096640
79291969-12-31 18:22:00"7929-1969-12-31 18:22:00.00000…04.9435857.4759790
60021969-12-31 20:07:00"6002-1969-12-31 20:07:00.00000…15.5935837.0765980
8641969-12-31 21:42:00"864-1969-12-31 21:42:00.000000"15.520416NaN0

And there we go! A dataframe ready for classification, containing:

  1. The citizen IDs

  2. Timestamps for each prediction time

  3. A unique identifier for each prediciton-time

  4. Our predictor columns, prefixed with pred_ and

  5. Our outcome columns, prefixed with outc_