Adding text features#

So far, the tutorials have dealt with tabular data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within timeseriesflattener.

Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.

The dataset#

To start out, let’s load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.

from __future__ import annotations

from timeseriesflattener.testing.load_synth_data import load_synth_text

synth_text = load_synth_text()
synth_text.head()

shape: (5, 3)

entity_id	timestamp	value
i64	datetime[μs]	str
4647	1967-07-19 00:22:00	"The patient went into a medica…
2007	1966-11-25 02:02:00	"The patient is taken to the em…
5799	1967-09-19 12:31:00	"The patient, described as a 7-…
1319	1969-07-21 23:16:00	"The patient had been left on a…
4234	1966-04-14 22:04:00	"The patient had had some sever…

Generating predictors from embedded text#

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using timeseriesflattener to speed up the computation if you’re generating multiple datasets. This first block will show you how to format a dataframe with embedded text for timeseriesflattener.

To start, let’s embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an entity_id_col, timestamp_col and any number of columns containing the embeddings, with a single value in each column.

For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.

%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer


# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
    tfidf_model = TfidfVectorizer(max_features=10)
    embeddings = tfidf_model.fit_transform(text)
    return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())


# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")

embedded_text_with_metadata.head()

shape: (5, 12)

entity_id	timestamp	and	for	in	of	or	patient	that	the	to	was
i64	datetime[μs]	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
4647	1967-07-19 00:22:00	0.175872	0.182066	0.249848	0.15843	0.0	0.023042	0.311389	0.529966	0.490203	0.479312
2007	1966-11-25 02:02:00	0.24487	0.0	0.135282	0.064337	0.465084	0.336859	0.151743	0.729861	0.179161	0.0
5799	1967-09-19 12:31:00	0.192367	0.232332	0.283402	0.336952	0.0	0.176422	0.238416	0.646879	0.250217	0.382277
1319	1969-07-21 23:16:00	0.165635	0.200046	0.183015	0.261115	0.125837	0.151906	0.205285	0.759528	0.403961	0.098747
4234	1966-04-14 22:04:00	0.493461	0.119196	0.272619	0.207444	0.0	0.045256	0.183475	0.588324	0.433253	0.235349

Now that we have our embeddings in a dataframe including the entity_id and timestamp, we can simply pass it to PredictorSpec!

import datetime as dt

import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator

text_spec = PredictorSpec.from_primitives(
    df=embedded_text_with_metadata,
    entity_id_col_name="entity_id",
    value_timestamp_col_name="timestamp",
    lookbehind_days=[365, 730],
    aggregators=["mean"],
    column_prefix="pred_tfidf",
    fallback=np.nan,
)

# Alternatively, if you prefer types
text_spec = PredictorSpec(
    ValueFrame(
        init_df=embedded_text_with_metadata,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
    aggregators=[MeanAggregator()],
    fallback=np.nan,
    column_prefix="pred_tfidf",
)

Let’s make some features!

We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.

# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=load_synth_prediction_times(),
        entity_id_col_name="entity_id",
        timestamp_col_name="timestamp",
    )
)

df = flattener.aggregate_timeseries(specs=[text_spec]).df

Processing spec: ['and', 'for', 'in', 'of', 'or', 'patient', 'that', 'the', 'to', 'was']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 13
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
flattener = Flattener(
   predictiontime_frame=PredictionTimeFrame(
       init_df=load_synth_prediction_times(),
   (...)     10     )
)
---> 13 df = flattener.aggregate_timeseries(specs=[text_spec]).df

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:160, in Flattener.aggregate_timeseries(self, specs, step_size)
if len(errors) > 0:
   raise SpecError(
       "Conflicting specs."
       + "".join(Iter(errors).map(lambda error: f"  \n - {error.description}").to_list())
   )
--> 160 dfs = self._process_specs(specs=specs, step_size=step_size)
feature_dfs = horizontally_concatenate_dfs(
   dfs,
   prediction_time_uuid_col_name=self.predictiontime_frame.prediction_time_uuid_col_name,
)
return AggregatedFrame(
   df=horizontally_concatenate_dfs(
       [self.predictiontime_frame.df, feature_dfs],
   (...)    174     timestamp_col_name=self.predictiontime_frame.timestamp_col_name,
)

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:117, in Flattener._process_specs(self, specs, step_size)
   for spec in track(specs, description="Processing specs..."):
       print(f"Processing spec: {spec.value_frame.value_col_names}")
--> 117         processed_spec = process_spec(
           predictiontime_frame=self.predictiontime_frame, spec=spec, step_size=step_size
       )
       dfs.append(processed_spec.df)
else:

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/__init__.py:30, in process_spec(spec, predictiontime_frame, step_size)
if isinstance(spec, StaticSpec):
   return process_static_spec(spec, predictiontime_frame)
---> 30 return process_temporal_spec(
   spec=spec, predictiontime_frame=predictiontime_frame, step_size=step_size
)

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:250, in process_temporal_spec(spec, predictiontime_frame, step_size)
def process_temporal_spec(
   spec: TemporalSpec,
   predictiontime_frame: PredictionTimeFrame,
   step_size: dt.timedelta | None = None,
) -> ProcessedFrame:
   if step_size is None:
--> 250         aggregated_value_frames = _flatten_temporal_spec(
           spec, predictiontime_frame, spec.value_frame
       )
       result_frame = horizontally_concatenate_dfs(
           dfs=aggregated_value_frames,
           prediction_time_uuid_col_name=predictiontime_frame.prediction_time_uuid_col_name,
       )
   else:

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:223, in _flatten_temporal_spec(spec, predictiontime_frame, value_frame)
def _flatten_temporal_spec(
   spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
) -> list[pl.DataFrame]:
   return (
       Iter(spec.normalised_lookperiod)
--> 223         .map(
           lambda lookperiod: _slice_and_aggregate_spec(
               timedelta_frame=_get_timedelta_frame(
                   predictiontime_frame=predictiontime_frame, value_frame=value_frame
               ),
               masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
                   aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
               ),
               time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
                   timedelta_frame=timedelta_frame,
                   lookperiod=lookperiod,
                   column_prefix=spec.column_prefix,
                   value_col_names=spec.value_frame.value_col_names,
               ),
           )
       )
       .flatten()
       .to_list()
   )

File ~/.local/lib/python3.12/site-packages/iterpy/iter.py:94, in Iter.map(self, func)
def map(  # Ignore that it's shadowing a python built-in
   self, func: Callable[[T], S]
) -> Iter[S]:
---> 94     return Iter(map(func, self._iterator))

File ~/.local/lib/python3.12/site-packages/iterpy/iter.py:21, in Iter.__init__(self, iterable)
def __init__(self, iterable: Iterable[T]) -> None:
---> 21     self._nonconsumable_iterable: list[T] = list(iterable)
   self._current_index: int = 0

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:224, in _flatten_temporal_spec.<locals>.<lambda>(lookperiod)
def _flatten_temporal_spec(
   spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
) -> list[pl.DataFrame]:
   return (
       Iter(spec.normalised_lookperiod)
       .map(
--> 224             lambda lookperiod: _slice_and_aggregate_spec(
               timedelta_frame=_get_timedelta_frame(
                   predictiontime_frame=predictiontime_frame, value_frame=value_frame
               ),
               masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
                   aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
               ),
               time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
                   timedelta_frame=timedelta_frame,
                   lookperiod=lookperiod,
                   column_prefix=spec.column_prefix,
                   value_col_names=spec.value_frame.value_col_names,
               ),
           )
       )
       .flatten()
       .to_list()
   )

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:148, in _slice_and_aggregate_spec(timedelta_frame, masked_aggregator, time_masker)
def _slice_and_aggregate_spec(
   timedelta_frame: TimeDeltaFrame, masked_aggregator: MaskedAggregator, time_masker: TimeMasker
) -> pl.DataFrame:
   sliced_frame = time_masker(timedelta_frame)
--> 148     return masked_aggregator(sliced_frame)

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:228, in _flatten_temporal_spec.<locals>.<lambda>.<locals>.<lambda>(sliced_frame)
def _flatten_temporal_spec(
   spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
) -> list[pl.DataFrame]:
   return (
       Iter(spec.normalised_lookperiod)
       .map(
           lambda lookperiod: _slice_and_aggregate_spec(
               timedelta_frame=_get_timedelta_frame(
                   predictiontime_frame=predictiontime_frame, value_frame=value_frame
               ),
--> 228                 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
                   aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
               ),
               time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
                   timedelta_frame=timedelta_frame,
                   lookperiod=lookperiod,
                   column_prefix=spec.column_prefix,
                   value_col_names=spec.value_frame.value_col_names,
               ),
           )
       )
       .flatten()
       .to_list()
   )

File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:134, in _aggregate_masked_frame(masked_frame, aggregators, fallback)
value_columns = (
   Iter(grouped_frame.columns)
   .filter(
   (...)    129     .map(lambda old_name: (old_name, f"{old_name}_fallback_{fallback}"))
)
rename_mapping = dict(value_columns)
with_fallback = grouped_frame.with_columns(
--> 134     cs.contains(*masked_frame.value_col_names).fill_null(fallback)
).rename(rename_mapping)
return with_fallback

TypeError: contains() takes 1 positional argument but 10 were given

Let’s check the output.

import polars as pl
import polars.selectors as cs

# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()

shape: (5, 23)

entity_id	timestamp	prediction_time_uuid	pred_tfidf_and_within_0_to_365_days_mean_fallback_nan	pred_tfidf_for_within_0_to_365_days_mean_fallback_nan	pred_tfidf_in_within_0_to_365_days_mean_fallback_nan	pred_tfidf_of_within_0_to_365_days_mean_fallback_nan	pred_tfidf_or_within_0_to_365_days_mean_fallback_nan	pred_tfidf_patient_within_0_to_365_days_mean_fallback_nan	pred_tfidf_that_within_0_to_365_days_mean_fallback_nan	pred_tfidf_the_within_0_to_365_days_mean_fallback_nan	pred_tfidf_to_within_0_to_365_days_mean_fallback_nan	pred_tfidf_was_within_0_to_365_days_mean_fallback_nan	pred_tfidf_and_within_0_to_730_days_mean_fallback_nan	pred_tfidf_for_within_0_to_730_days_mean_fallback_nan	pred_tfidf_in_within_0_to_730_days_mean_fallback_nan	pred_tfidf_of_within_0_to_730_days_mean_fallback_nan	pred_tfidf_or_within_0_to_730_days_mean_fallback_nan	pred_tfidf_patient_within_0_to_730_days_mean_fallback_nan	pred_tfidf_that_within_0_to_730_days_mean_fallback_nan	pred_tfidf_the_within_0_to_730_days_mean_fallback_nan	pred_tfidf_to_within_0_to_730_days_mean_fallback_nan	pred_tfidf_was_within_0_to_730_days_mean_fallback_nan
i64	datetime[μs]	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
6840	1965-11-02 07:17:00	"6840-1965-11-02 07:17:00.00000…	0.155821	0.376386	0.258256	0.573168	0.355142	0.071452	0.096561	0.28581	0.45603	0.092896	0.155821	0.376386	0.258256	0.573168	0.355142	0.071452	0.096561	0.28581	0.45603	0.092896
2039	1966-04-20 05:06:00	"2039-1966-04-20 05:06:00.00000…	0.108015	0.0	0.596744	0.11352	0.0	0.099062	0.133872	0.693431	0.210747	0.257581	0.108015	0.0	0.596744	0.11352	0.0	0.099062	0.133872	0.693431	0.210747	0.257581
9496	1966-12-06 06:44:00	"9496-1966-12-06 06:44:00.00000…	0.279955	0.0	0.30933	0.294222	0.0	0.256749	0.0	0.513498	0.546216	0.3338	0.279955	0.0	0.30933	0.294222	0.0	0.256749	0.0	0.513498	0.546216	0.3338
7281	1967-06-05 00:41:00	"7281-1967-06-05 00:41:00.00000…	0.289663	0.04373	0.280049	0.304425	0.385111	0.332065	0.269251	0.464891	0.211934	0.388547	0.289663	0.04373	0.280049	0.304425	0.385111	0.332065	0.269251	0.464891	0.211934	0.388547
7424	1967-07-13 15:01:00	"7424-1967-07-13 15:01:00.00000…	0.153907	0.092941	0.170056	0.107834	0.389756	0.282299	0.063583	0.682222	0.475452	0.0	0.153907	0.092941	0.170056	0.107834	0.389756	0.282299	0.063583	0.682222	0.475452	0.0

And just like that, you’re ready to make a prediction model!