Adding text features#
So far, the tutorials have dealt with tabular data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within timeseriesflattener.
Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.
The dataset#
To start out, let’s load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.
from __future__ import annotations
from timeseriesflattener.testing.load_synth_data import load_synth_text
synth_text = load_synth_text()
synth_text.head()
| entity_id | timestamp | value |
|---|---|---|
| i64 | datetime[μs] | str |
| 4647 | 1967-07-19 00:22:00 | "The patient went into a medica… |
| 2007 | 1966-11-25 02:02:00 | "The patient is taken to the em… |
| 5799 | 1967-09-19 12:31:00 | "The patient, described as a 7-… |
| 1319 | 1969-07-21 23:16:00 | "The patient had been left on a… |
| 4234 | 1966-04-14 22:04:00 | "The patient had had some sever… |
Generating predictors from embedded text#
As generating text embeddings can often take a while, it can be an advantageous to embed the text before using timeseriesflattener to speed up the computation if you’re generating multiple datasets. This first block will show you how to format a dataframe with embedded text for timeseriesflattener.
To start, let’s embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an entity_id_col, timestamp_col and any number of columns containing the embeddings, with a single value in each column.
For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.
%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer
# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
tfidf_model = TfidfVectorizer(max_features=10)
embeddings = tfidf_model.fit_transform(text)
return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())
# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")
embedded_text_with_metadata.head()
| entity_id | timestamp | and | for | in | of | or | patient | that | the | to | was |
|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| 4647 | 1967-07-19 00:22:00 | 0.175872 | 0.182066 | 0.249848 | 0.15843 | 0.0 | 0.023042 | 0.311389 | 0.529966 | 0.490203 | 0.479312 |
| 2007 | 1966-11-25 02:02:00 | 0.24487 | 0.0 | 0.135282 | 0.064337 | 0.465084 | 0.336859 | 0.151743 | 0.729861 | 0.179161 | 0.0 |
| 5799 | 1967-09-19 12:31:00 | 0.192367 | 0.232332 | 0.283402 | 0.336952 | 0.0 | 0.176422 | 0.238416 | 0.646879 | 0.250217 | 0.382277 |
| 1319 | 1969-07-21 23:16:00 | 0.165635 | 0.200046 | 0.183015 | 0.261115 | 0.125837 | 0.151906 | 0.205285 | 0.759528 | 0.403961 | 0.098747 |
| 4234 | 1966-04-14 22:04:00 | 0.493461 | 0.119196 | 0.272619 | 0.207444 | 0.0 | 0.045256 | 0.183475 | 0.588324 | 0.433253 | 0.235349 |
Now that we have our embeddings in a dataframe including the entity_id and timestamp, we can simply pass it to PredictorSpec!
import datetime as dt
import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator
text_spec = PredictorSpec.from_primitives(
df=embedded_text_with_metadata,
entity_id_col_name="entity_id",
value_timestamp_col_name="timestamp",
lookbehind_days=[365, 730],
aggregators=["mean"],
column_prefix="pred_tfidf",
fallback=np.nan,
)
# Alternatively, if you prefer types
text_spec = PredictorSpec(
ValueFrame(
init_df=embedded_text_with_metadata,
entity_id_col_name="entity_id",
value_timestamp_col_name="timestamp",
),
lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
aggregators=[MeanAggregator()],
fallback=np.nan,
column_prefix="pred_tfidf",
)
Let’s make some features!
We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.
# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
flattener = Flattener(
predictiontime_frame=PredictionTimeFrame(
init_df=load_synth_prediction_times(),
entity_id_col_name="entity_id",
timestamp_col_name="timestamp",
)
)
df = flattener.aggregate_timeseries(specs=[text_spec]).df
Processing spec: ['and', 'for', 'in', 'of', 'or', 'patient', 'that', 'the', 'to', 'was']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 13
3 from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
5 flattener = Flattener(
6 predictiontime_frame=PredictionTimeFrame(
7 init_df=load_synth_prediction_times(),
(...) 10 )
11 )
---> 13 df = flattener.aggregate_timeseries(specs=[text_spec]).df
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:160, in Flattener.aggregate_timeseries(self, specs, step_size)
154 if len(errors) > 0:
155 raise SpecError(
156 "Conflicting specs."
157 + "".join(Iter(errors).map(lambda error: f" \n - {error.description}").to_list())
158 )
--> 160 dfs = self._process_specs(specs=specs, step_size=step_size)
162 feature_dfs = horizontally_concatenate_dfs(
163 dfs,
164 prediction_time_uuid_col_name=self.predictiontime_frame.prediction_time_uuid_col_name,
165 )
167 return AggregatedFrame(
168 df=horizontally_concatenate_dfs(
169 [self.predictiontime_frame.df, feature_dfs],
(...) 174 timestamp_col_name=self.predictiontime_frame.timestamp_col_name,
175 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:117, in Flattener._process_specs(self, specs, step_size)
115 for spec in track(specs, description="Processing specs..."):
116 print(f"Processing spec: {spec.value_frame.value_col_names}")
--> 117 processed_spec = process_spec(
118 predictiontime_frame=self.predictiontime_frame, spec=spec, step_size=step_size
119 )
120 dfs.append(processed_spec.df)
121 else:
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/__init__.py:30, in process_spec(spec, predictiontime_frame, step_size)
27 if isinstance(spec, StaticSpec):
28 return process_static_spec(spec, predictiontime_frame)
---> 30 return process_temporal_spec(
31 spec=spec, predictiontime_frame=predictiontime_frame, step_size=step_size
32 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:250, in process_temporal_spec(spec, predictiontime_frame, step_size)
244 def process_temporal_spec(
245 spec: TemporalSpec,
246 predictiontime_frame: PredictionTimeFrame,
247 step_size: dt.timedelta | None = None,
248 ) -> ProcessedFrame:
249 if step_size is None:
--> 250 aggregated_value_frames = _flatten_temporal_spec(
251 spec, predictiontime_frame, spec.value_frame
252 )
254 result_frame = horizontally_concatenate_dfs(
255 dfs=aggregated_value_frames,
256 prediction_time_uuid_col_name=predictiontime_frame.prediction_time_uuid_col_name,
257 )
259 else:
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:223, in _flatten_temporal_spec(spec, predictiontime_frame, value_frame)
218 def _flatten_temporal_spec(
219 spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
220 ) -> list[pl.DataFrame]:
221 return (
222 Iter(spec.normalised_lookperiod)
--> 223 .map(
224 lambda lookperiod: _slice_and_aggregate_spec(
225 timedelta_frame=_get_timedelta_frame(
226 predictiontime_frame=predictiontime_frame, value_frame=value_frame
227 ),
228 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
229 aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
230 ),
231 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
232 timedelta_frame=timedelta_frame,
233 lookperiod=lookperiod,
234 column_prefix=spec.column_prefix,
235 value_col_names=spec.value_frame.value_col_names,
236 ),
237 )
238 )
239 .flatten()
240 .to_list()
241 )
File ~/.local/lib/python3.12/site-packages/iterpy/iter.py:94, in Iter.map(self, func)
91 def map( # Ignore that it's shadowing a python built-in
92 self, func: Callable[[T], S]
93 ) -> Iter[S]:
---> 94 return Iter(map(func, self._iterator))
File ~/.local/lib/python3.12/site-packages/iterpy/iter.py:21, in Iter.__init__(self, iterable)
20 def __init__(self, iterable: Iterable[T]) -> None:
---> 21 self._nonconsumable_iterable: list[T] = list(iterable)
22 self._current_index: int = 0
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:224, in _flatten_temporal_spec.<locals>.<lambda>(lookperiod)
218 def _flatten_temporal_spec(
219 spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
220 ) -> list[pl.DataFrame]:
221 return (
222 Iter(spec.normalised_lookperiod)
223 .map(
--> 224 lambda lookperiod: _slice_and_aggregate_spec(
225 timedelta_frame=_get_timedelta_frame(
226 predictiontime_frame=predictiontime_frame, value_frame=value_frame
227 ),
228 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
229 aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
230 ),
231 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
232 timedelta_frame=timedelta_frame,
233 lookperiod=lookperiod,
234 column_prefix=spec.column_prefix,
235 value_col_names=spec.value_frame.value_col_names,
236 ),
237 )
238 )
239 .flatten()
240 .to_list()
241 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:148, in _slice_and_aggregate_spec(timedelta_frame, masked_aggregator, time_masker)
144 def _slice_and_aggregate_spec(
145 timedelta_frame: TimeDeltaFrame, masked_aggregator: MaskedAggregator, time_masker: TimeMasker
146 ) -> pl.DataFrame:
147 sliced_frame = time_masker(timedelta_frame)
--> 148 return masked_aggregator(sliced_frame)
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:228, in _flatten_temporal_spec.<locals>.<lambda>.<locals>.<lambda>(sliced_frame)
218 def _flatten_temporal_spec(
219 spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
220 ) -> list[pl.DataFrame]:
221 return (
222 Iter(spec.normalised_lookperiod)
223 .map(
224 lambda lookperiod: _slice_and_aggregate_spec(
225 timedelta_frame=_get_timedelta_frame(
226 predictiontime_frame=predictiontime_frame, value_frame=value_frame
227 ),
--> 228 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
229 aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
230 ),
231 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
232 timedelta_frame=timedelta_frame,
233 lookperiod=lookperiod,
234 column_prefix=spec.column_prefix,
235 value_col_names=spec.value_frame.value_col_names,
236 ),
237 )
238 )
239 .flatten()
240 .to_list()
241 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:134, in _aggregate_masked_frame(masked_frame, aggregators, fallback)
122 value_columns = (
123 Iter(grouped_frame.columns)
124 .filter(
(...) 129 .map(lambda old_name: (old_name, f"{old_name}_fallback_{fallback}"))
130 )
131 rename_mapping = dict(value_columns)
133 with_fallback = grouped_frame.with_columns(
--> 134 cs.contains(*masked_frame.value_col_names).fill_null(fallback)
135 ).rename(rename_mapping)
137 return with_fallback
TypeError: contains() takes 1 positional argument but 10 were given
Let’s check the output.
import polars as pl
import polars.selectors as cs
# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()
| entity_id | timestamp | prediction_time_uuid | pred_tfidf_and_within_0_to_365_days_mean_fallback_nan | pred_tfidf_for_within_0_to_365_days_mean_fallback_nan | pred_tfidf_in_within_0_to_365_days_mean_fallback_nan | pred_tfidf_of_within_0_to_365_days_mean_fallback_nan | pred_tfidf_or_within_0_to_365_days_mean_fallback_nan | pred_tfidf_patient_within_0_to_365_days_mean_fallback_nan | pred_tfidf_that_within_0_to_365_days_mean_fallback_nan | pred_tfidf_the_within_0_to_365_days_mean_fallback_nan | pred_tfidf_to_within_0_to_365_days_mean_fallback_nan | pred_tfidf_was_within_0_to_365_days_mean_fallback_nan | pred_tfidf_and_within_0_to_730_days_mean_fallback_nan | pred_tfidf_for_within_0_to_730_days_mean_fallback_nan | pred_tfidf_in_within_0_to_730_days_mean_fallback_nan | pred_tfidf_of_within_0_to_730_days_mean_fallback_nan | pred_tfidf_or_within_0_to_730_days_mean_fallback_nan | pred_tfidf_patient_within_0_to_730_days_mean_fallback_nan | pred_tfidf_that_within_0_to_730_days_mean_fallback_nan | pred_tfidf_the_within_0_to_730_days_mean_fallback_nan | pred_tfidf_to_within_0_to_730_days_mean_fallback_nan | pred_tfidf_was_within_0_to_730_days_mean_fallback_nan |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | datetime[μs] | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| 6840 | 1965-11-02 07:17:00 | "6840-1965-11-02 07:17:00.00000… | 0.155821 | 0.376386 | 0.258256 | 0.573168 | 0.355142 | 0.071452 | 0.096561 | 0.28581 | 0.45603 | 0.092896 | 0.155821 | 0.376386 | 0.258256 | 0.573168 | 0.355142 | 0.071452 | 0.096561 | 0.28581 | 0.45603 | 0.092896 |
| 2039 | 1966-04-20 05:06:00 | "2039-1966-04-20 05:06:00.00000… | 0.108015 | 0.0 | 0.596744 | 0.11352 | 0.0 | 0.099062 | 0.133872 | 0.693431 | 0.210747 | 0.257581 | 0.108015 | 0.0 | 0.596744 | 0.11352 | 0.0 | 0.099062 | 0.133872 | 0.693431 | 0.210747 | 0.257581 |
| 9496 | 1966-12-06 06:44:00 | "9496-1966-12-06 06:44:00.00000… | 0.279955 | 0.0 | 0.30933 | 0.294222 | 0.0 | 0.256749 | 0.0 | 0.513498 | 0.546216 | 0.3338 | 0.279955 | 0.0 | 0.30933 | 0.294222 | 0.0 | 0.256749 | 0.0 | 0.513498 | 0.546216 | 0.3338 |
| 7281 | 1967-06-05 00:41:00 | "7281-1967-06-05 00:41:00.00000… | 0.289663 | 0.04373 | 0.280049 | 0.304425 | 0.385111 | 0.332065 | 0.269251 | 0.464891 | 0.211934 | 0.388547 | 0.289663 | 0.04373 | 0.280049 | 0.304425 | 0.385111 | 0.332065 | 0.269251 | 0.464891 | 0.211934 | 0.388547 |
| 7424 | 1967-07-13 15:01:00 | "7424-1967-07-13 15:01:00.00000… | 0.153907 | 0.092941 | 0.170056 | 0.107834 | 0.389756 | 0.282299 | 0.063583 | 0.682222 | 0.475452 | 0.0 | 0.153907 | 0.092941 | 0.170056 | 0.107834 | 0.389756 | 0.282299 | 0.063583 | 0.682222 | 0.475452 | 0.0 |
And just like that, you’re ready to make a prediction model!