Adding text features#

So far, the tutorials have dealt with tabular data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within timeseriesflattener.

Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.

The dataset#

To start out, let’s load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.

from __future__ import annotations

from timeseriesflattener.testing.load_synth_data import load_synth_text
synth_text = load_synth_text()
synth_text.head()
shape: (5, 3)
entity_idtimestampvalue
i64datetime[μs]str
46471967-07-19 00:22:00"The patient went into a medica…
20071966-11-25 02:02:00"The patient is taken to the em…
57991967-09-19 12:31:00"The patient, described as a 7-…
13191969-07-21 23:16:00"The patient had been left on a…
42341966-04-14 22:04:00"The patient had had some sever…

Generating predictors from embedded text#

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using timeseriesflattener to speed up the computation if you’re generating multiple datasets. This first block will show you how to format a dataframe with embedded text for timeseriesflattener.

To start, let’s embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an entity_id_col, timestamp_col and any number of columns containing the embeddings, with a single value in each column.

For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.

%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer


# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
    tfidf_model = TfidfVectorizer(max_features=10)
    embeddings = tfidf_model.fit_transform(text)
    return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())


# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")
embedded_text_with_metadata.head()
shape: (5, 12)
entity_idtimestampandforinoforpatientthatthetowas
i64datetime[μs]f64f64f64f64f64f64f64f64f64f64
46471967-07-19 00:22:000.1758720.1820660.2498480.158430.00.0230420.3113890.5299660.4902030.479312
20071966-11-25 02:02:000.244870.00.1352820.0643370.4650840.3368590.1517430.7298610.1791610.0
57991967-09-19 12:31:000.1923670.2323320.2834020.3369520.00.1764220.2384160.6468790.2502170.382277
13191969-07-21 23:16:000.1656350.2000460.1830150.2611150.1258370.1519060.2052850.7595280.4039610.098747
42341966-04-14 22:04:000.4934610.1191960.2726190.2074440.00.0452560.1834750.5883240.4332530.235349

Now that we have our embeddings in a dataframe including the entity_id and timestamp, we can simply pass it to PredictorSpec!

import datetime as dt

import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator

text_spec = PredictorSpec.from_primitives(
    df=embedded_text_with_metadata,
    entity_id_col_name="entity_id",
    value_timestamp_col_name="timestamp",
    lookbehind_days=[365, 730],
    aggregators=["mean"],
    column_prefix="pred_tfidf",
    fallback=np.nan,
)

# Alternatively, if you prefer types
text_spec = PredictorSpec(
    ValueFrame(
        init_df=embedded_text_with_metadata,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
    aggregators=[MeanAggregator()],
    fallback=np.nan,
    column_prefix="pred_tfidf",
)

Let’s make some features!

We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.

# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=load_synth_prediction_times(),
        entity_id_col_name="entity_id",
        timestamp_col_name="timestamp",
    )
)

df = flattener.aggregate_timeseries(specs=[text_spec]).df
Processing spec: ['and', 'for', 'in', 'of', 'or', 'patient', 'that', 'the', 'to', 'was']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 13
      3 from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
      5 flattener = Flattener(
      6     predictiontime_frame=PredictionTimeFrame(
      7         init_df=load_synth_prediction_times(),
   (...)
     10     )
     11 )
---> 13 df = flattener.aggregate_timeseries(specs=[text_spec]).df

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:160, in Flattener.aggregate_timeseries(self, specs, step_size)
    154 if len(errors) > 0:
    155     raise SpecError(
    156         "Conflicting specs."
    157         + "".join(Iter(errors).map(lambda error: f"  \n - {error.description}").to_list())
    158     )
--> 160 dfs = self._process_specs(specs=specs, step_size=step_size)
    162 feature_dfs = horizontally_concatenate_dfs(
    163     dfs,
    164     prediction_time_uuid_col_name=self.predictiontime_frame.prediction_time_uuid_col_name,
    165 )
    167 return AggregatedFrame(
    168     df=horizontally_concatenate_dfs(
    169         [self.predictiontime_frame.df, feature_dfs],
   (...)
    174     timestamp_col_name=self.predictiontime_frame.timestamp_col_name,
    175 )

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:117, in Flattener._process_specs(self, specs, step_size)
    115     for spec in track(specs, description="Processing specs..."):
    116         print(f"Processing spec: {spec.value_frame.value_col_names}")
--> 117         processed_spec = process_spec(
    118             predictiontime_frame=self.predictiontime_frame, spec=spec, step_size=step_size
    119         )
    120         dfs.append(processed_spec.df)
    121 else:

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/__init__.py:30, in process_spec(spec, predictiontime_frame, step_size)
     27 if isinstance(spec, StaticSpec):
     28     return process_static_spec(spec, predictiontime_frame)
---> 30 return process_temporal_spec(
     31     spec=spec, predictiontime_frame=predictiontime_frame, step_size=step_size
     32 )

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:250, in process_temporal_spec(spec, predictiontime_frame, step_size)
    244 def process_temporal_spec(
    245     spec: TemporalSpec,
    246     predictiontime_frame: PredictionTimeFrame,
    247     step_size: dt.timedelta | None = None,
    248 ) -> ProcessedFrame:
    249     if step_size is None:
--> 250         aggregated_value_frames = _flatten_temporal_spec(
    251             spec, predictiontime_frame, spec.value_frame
    252         )
    254         result_frame = horizontally_concatenate_dfs(
    255             dfs=aggregated_value_frames,
    256             prediction_time_uuid_col_name=predictiontime_frame.prediction_time_uuid_col_name,
    257         )
    259     else:

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:223, in _flatten_temporal_spec(spec, predictiontime_frame, value_frame)
    218 def _flatten_temporal_spec(
    219     spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
    220 ) -> list[pl.DataFrame]:
    221     return (
    222         Iter(spec.normalised_lookperiod)
--> 223         .map(
    224             lambda lookperiod: _slice_and_aggregate_spec(
    225                 timedelta_frame=_get_timedelta_frame(
    226                     predictiontime_frame=predictiontime_frame, value_frame=value_frame
    227                 ),
    228                 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
    229                     aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
    230                 ),
    231                 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
    232                     timedelta_frame=timedelta_frame,
    233                     lookperiod=lookperiod,
    234                     column_prefix=spec.column_prefix,
    235                     value_col_names=spec.value_frame.value_col_names,
    236                 ),
    237             )
    238         )
    239         .flatten()
    240         .to_list()
    241     )

File ~/.local/lib/python3.10/site-packages/iterpy/iter.py:94, in Iter.map(self, func)
     91 def map(  # Ignore that it's shadowing a python built-in
     92     self, func: Callable[[T], S]
     93 ) -> Iter[S]:
---> 94     return Iter(map(func, self._iterator))

File ~/.local/lib/python3.10/site-packages/iterpy/iter.py:21, in Iter.__init__(self, iterable)
     20 def __init__(self, iterable: Iterable[T]) -> None:
---> 21     self._nonconsumable_iterable: list[T] = list(iterable)
     22     self._current_index: int = 0

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:224, in _flatten_temporal_spec.<locals>.<lambda>(lookperiod)
    218 def _flatten_temporal_spec(
    219     spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
    220 ) -> list[pl.DataFrame]:
    221     return (
    222         Iter(spec.normalised_lookperiod)
    223         .map(
--> 224             lambda lookperiod: _slice_and_aggregate_spec(
    225                 timedelta_frame=_get_timedelta_frame(
    226                     predictiontime_frame=predictiontime_frame, value_frame=value_frame
    227                 ),
    228                 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
    229                     aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
    230                 ),
    231                 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
    232                     timedelta_frame=timedelta_frame,
    233                     lookperiod=lookperiod,
    234                     column_prefix=spec.column_prefix,
    235                     value_col_names=spec.value_frame.value_col_names,
    236                 ),
    237             )
    238         )
    239         .flatten()
    240         .to_list()
    241     )

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:148, in _slice_and_aggregate_spec(timedelta_frame, masked_aggregator, time_masker)
    144 def _slice_and_aggregate_spec(
    145     timedelta_frame: TimeDeltaFrame, masked_aggregator: MaskedAggregator, time_masker: TimeMasker
    146 ) -> pl.DataFrame:
    147     sliced_frame = time_masker(timedelta_frame)
--> 148     return masked_aggregator(sliced_frame)

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:228, in _flatten_temporal_spec.<locals>.<lambda>.<locals>.<lambda>(sliced_frame)
    218 def _flatten_temporal_spec(
    219     spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
    220 ) -> list[pl.DataFrame]:
    221     return (
    222         Iter(spec.normalised_lookperiod)
    223         .map(
    224             lambda lookperiod: _slice_and_aggregate_spec(
    225                 timedelta_frame=_get_timedelta_frame(
    226                     predictiontime_frame=predictiontime_frame, value_frame=value_frame
    227                 ),
--> 228                 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
    229                     aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
    230                 ),
    231                 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
    232                     timedelta_frame=timedelta_frame,
    233                     lookperiod=lookperiod,
    234                     column_prefix=spec.column_prefix,
    235                     value_col_names=spec.value_frame.value_col_names,
    236                 ),
    237             )
    238         )
    239         .flatten()
    240         .to_list()
    241     )

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:134, in _aggregate_masked_frame(masked_frame, aggregators, fallback)
    122 value_columns = (
    123     Iter(grouped_frame.columns)
    124     .filter(
   (...)
    129     .map(lambda old_name: (old_name, f"{old_name}_fallback_{fallback}"))
    130 )
    131 rename_mapping = dict(value_columns)
    133 with_fallback = grouped_frame.with_columns(
--> 134     cs.contains(*masked_frame.value_col_names).fill_null(fallback)
    135 ).rename(rename_mapping)
    137 return with_fallback

TypeError: contains() takes 1 positional argument but 10 were given

Let’s check the output.

import polars as pl
import polars.selectors as cs

# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()
shape: (5, 23)
entity_idtimestampprediction_time_uuidpred_tfidf_and_within_0_to_365_days_mean_fallback_nanpred_tfidf_for_within_0_to_365_days_mean_fallback_nanpred_tfidf_in_within_0_to_365_days_mean_fallback_nanpred_tfidf_of_within_0_to_365_days_mean_fallback_nanpred_tfidf_or_within_0_to_365_days_mean_fallback_nanpred_tfidf_patient_within_0_to_365_days_mean_fallback_nanpred_tfidf_that_within_0_to_365_days_mean_fallback_nanpred_tfidf_the_within_0_to_365_days_mean_fallback_nanpred_tfidf_to_within_0_to_365_days_mean_fallback_nanpred_tfidf_was_within_0_to_365_days_mean_fallback_nanpred_tfidf_and_within_0_to_730_days_mean_fallback_nanpred_tfidf_for_within_0_to_730_days_mean_fallback_nanpred_tfidf_in_within_0_to_730_days_mean_fallback_nanpred_tfidf_of_within_0_to_730_days_mean_fallback_nanpred_tfidf_or_within_0_to_730_days_mean_fallback_nanpred_tfidf_patient_within_0_to_730_days_mean_fallback_nanpred_tfidf_that_within_0_to_730_days_mean_fallback_nanpred_tfidf_the_within_0_to_730_days_mean_fallback_nanpred_tfidf_to_within_0_to_730_days_mean_fallback_nanpred_tfidf_was_within_0_to_730_days_mean_fallback_nan
i64datetime[μs]strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
68401965-11-02 07:17:00"6840-1965-11-02 07:17:00.00000…0.1558210.3763860.2582560.5731680.3551420.0714520.0965610.285810.456030.0928960.1558210.3763860.2582560.5731680.3551420.0714520.0965610.285810.456030.092896
20391966-04-20 05:06:00"2039-1966-04-20 05:06:00.00000…0.1080150.00.5967440.113520.00.0990620.1338720.6934310.2107470.2575810.1080150.00.5967440.113520.00.0990620.1338720.6934310.2107470.257581
94961966-12-06 06:44:00"9496-1966-12-06 06:44:00.00000…0.2799550.00.309330.2942220.00.2567490.00.5134980.5462160.33380.2799550.00.309330.2942220.00.2567490.00.5134980.5462160.3338
72811967-06-05 00:41:00"7281-1967-06-05 00:41:00.00000…0.2896630.043730.2800490.3044250.3851110.3320650.2692510.4648910.2119340.3885470.2896630.043730.2800490.3044250.3851110.3320650.2692510.4648910.2119340.388547
74241967-07-13 15:01:00"7424-1967-07-13 15:01:00.00000…0.1539070.0929410.1700560.1078340.3897560.2822990.0635830.6822220.4754520.00.1539070.0929410.1700560.1078340.3897560.2822990.0635830.6822220.4754520.0

And just like that, you’re ready to make a prediction model!