Adding text features#
So far, the tutorials have dealt with tabular data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within timeseriesflattener
.
Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.
The dataset#
To start out, let’s load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.
from __future__ import annotations
from timeseriesflattener.testing.load_synth_data import load_synth_text
synth_text = load_synth_text()
synth_text.head()
entity_id | timestamp | value |
---|---|---|
i64 | datetime[μs] | str |
4647 | 1967-07-19 00:22:00 | "The patient went into a medica… |
2007 | 1966-11-25 02:02:00 | "The patient is taken to the em… |
5799 | 1967-09-19 12:31:00 | "The patient, described as a 7-… |
1319 | 1969-07-21 23:16:00 | "The patient had been left on a… |
4234 | 1966-04-14 22:04:00 | "The patient had had some sever… |
Generating predictors from embedded text#
As generating text embeddings can often take a while, it can be an advantageous to embed the text before using timeseriesflattener
to speed up the computation if you’re generating multiple datasets. This first block will show you how to format a dataframe with embedded text for timeseriesflattener
.
To start, let’s embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an entity_id_col
, timestamp_col
and any number of columns containing the embeddings, with a single value in each column.
For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.
%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer
# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
tfidf_model = TfidfVectorizer(max_features=10)
embeddings = tfidf_model.fit_transform(text)
return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())
# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")
embedded_text_with_metadata.head()
entity_id | timestamp | and | for | in | of | or | patient | that | the | to | was |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | datetime[μs] | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
4647 | 1967-07-19 00:22:00 | 0.175872 | 0.182066 | 0.249848 | 0.15843 | 0.0 | 0.023042 | 0.311389 | 0.529966 | 0.490203 | 0.479312 |
2007 | 1966-11-25 02:02:00 | 0.24487 | 0.0 | 0.135282 | 0.064337 | 0.465084 | 0.336859 | 0.151743 | 0.729861 | 0.179161 | 0.0 |
5799 | 1967-09-19 12:31:00 | 0.192367 | 0.232332 | 0.283402 | 0.336952 | 0.0 | 0.176422 | 0.238416 | 0.646879 | 0.250217 | 0.382277 |
1319 | 1969-07-21 23:16:00 | 0.165635 | 0.200046 | 0.183015 | 0.261115 | 0.125837 | 0.151906 | 0.205285 | 0.759528 | 0.403961 | 0.098747 |
4234 | 1966-04-14 22:04:00 | 0.493461 | 0.119196 | 0.272619 | 0.207444 | 0.0 | 0.045256 | 0.183475 | 0.588324 | 0.433253 | 0.235349 |
Now that we have our embeddings in a dataframe including the entity_id
and timestamp
, we can simply pass it to PredictorSpec
!
import datetime as dt
import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator
text_spec = PredictorSpec.from_primitives(
df=embedded_text_with_metadata,
entity_id_col_name="entity_id",
value_timestamp_col_name="timestamp",
lookbehind_days=[365, 730],
aggregators=["mean"],
column_prefix="pred_tfidf",
fallback=np.nan,
)
# Alternatively, if you prefer types
text_spec = PredictorSpec(
ValueFrame(
init_df=embedded_text_with_metadata,
entity_id_col_name="entity_id",
value_timestamp_col_name="timestamp",
),
lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
aggregators=[MeanAggregator()],
fallback=np.nan,
column_prefix="pred_tfidf",
)
Let’s make some features!
We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.
# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
flattener = Flattener(
predictiontime_frame=PredictionTimeFrame(
init_df=load_synth_prediction_times(),
entity_id_col_name="entity_id",
timestamp_col_name="timestamp",
)
)
df = flattener.aggregate_timeseries(specs=[text_spec]).df
Processing spec: ['and', 'for', 'in', 'of', 'or', 'patient', 'that', 'the', 'to', 'was']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 13
3 from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
5 flattener = Flattener(
6 predictiontime_frame=PredictionTimeFrame(
7 init_df=load_synth_prediction_times(),
(...)
10 )
11 )
---> 13 df = flattener.aggregate_timeseries(specs=[text_spec]).df
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:160, in Flattener.aggregate_timeseries(self, specs, step_size)
154 if len(errors) > 0:
155 raise SpecError(
156 "Conflicting specs."
157 + "".join(Iter(errors).map(lambda error: f" \n - {error.description}").to_list())
158 )
--> 160 dfs = self._process_specs(specs=specs, step_size=step_size)
162 feature_dfs = horizontally_concatenate_dfs(
163 dfs,
164 prediction_time_uuid_col_name=self.predictiontime_frame.prediction_time_uuid_col_name,
165 )
167 return AggregatedFrame(
168 df=horizontally_concatenate_dfs(
169 [self.predictiontime_frame.df, feature_dfs],
(...)
174 timestamp_col_name=self.predictiontime_frame.timestamp_col_name,
175 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/main.py:117, in Flattener._process_specs(self, specs, step_size)
115 for spec in track(specs, description="Processing specs..."):
116 print(f"Processing spec: {spec.value_frame.value_col_names}")
--> 117 processed_spec = process_spec(
118 predictiontime_frame=self.predictiontime_frame, spec=spec, step_size=step_size
119 )
120 dfs.append(processed_spec.df)
121 else:
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/__init__.py:30, in process_spec(spec, predictiontime_frame, step_size)
27 if isinstance(spec, StaticSpec):
28 return process_static_spec(spec, predictiontime_frame)
---> 30 return process_temporal_spec(
31 spec=spec, predictiontime_frame=predictiontime_frame, step_size=step_size
32 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:250, in process_temporal_spec(spec, predictiontime_frame, step_size)
244 def process_temporal_spec(
245 spec: TemporalSpec,
246 predictiontime_frame: PredictionTimeFrame,
247 step_size: dt.timedelta | None = None,
248 ) -> ProcessedFrame:
249 if step_size is None:
--> 250 aggregated_value_frames = _flatten_temporal_spec(
251 spec, predictiontime_frame, spec.value_frame
252 )
254 result_frame = horizontally_concatenate_dfs(
255 dfs=aggregated_value_frames,
256 prediction_time_uuid_col_name=predictiontime_frame.prediction_time_uuid_col_name,
257 )
259 else:
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:223, in _flatten_temporal_spec(spec, predictiontime_frame, value_frame)
218 def _flatten_temporal_spec(
219 spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
220 ) -> list[pl.DataFrame]:
221 return (
222 Iter(spec.normalised_lookperiod)
--> 223 .map(
224 lambda lookperiod: _slice_and_aggregate_spec(
225 timedelta_frame=_get_timedelta_frame(
226 predictiontime_frame=predictiontime_frame, value_frame=value_frame
227 ),
228 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
229 aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
230 ),
231 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
232 timedelta_frame=timedelta_frame,
233 lookperiod=lookperiod,
234 column_prefix=spec.column_prefix,
235 value_col_names=spec.value_frame.value_col_names,
236 ),
237 )
238 )
239 .flatten()
240 .to_list()
241 )
File ~/.local/lib/python3.10/site-packages/iterpy/iter.py:94, in Iter.map(self, func)
91 def map( # Ignore that it's shadowing a python built-in
92 self, func: Callable[[T], S]
93 ) -> Iter[S]:
---> 94 return Iter(map(func, self._iterator))
File ~/.local/lib/python3.10/site-packages/iterpy/iter.py:21, in Iter.__init__(self, iterable)
20 def __init__(self, iterable: Iterable[T]) -> None:
---> 21 self._nonconsumable_iterable: list[T] = list(iterable)
22 self._current_index: int = 0
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:224, in _flatten_temporal_spec.<locals>.<lambda>(lookperiod)
218 def _flatten_temporal_spec(
219 spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
220 ) -> list[pl.DataFrame]:
221 return (
222 Iter(spec.normalised_lookperiod)
223 .map(
--> 224 lambda lookperiod: _slice_and_aggregate_spec(
225 timedelta_frame=_get_timedelta_frame(
226 predictiontime_frame=predictiontime_frame, value_frame=value_frame
227 ),
228 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
229 aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
230 ),
231 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
232 timedelta_frame=timedelta_frame,
233 lookperiod=lookperiod,
234 column_prefix=spec.column_prefix,
235 value_col_names=spec.value_frame.value_col_names,
236 ),
237 )
238 )
239 .flatten()
240 .to_list()
241 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:148, in _slice_and_aggregate_spec(timedelta_frame, masked_aggregator, time_masker)
144 def _slice_and_aggregate_spec(
145 timedelta_frame: TimeDeltaFrame, masked_aggregator: MaskedAggregator, time_masker: TimeMasker
146 ) -> pl.DataFrame:
147 sliced_frame = time_masker(timedelta_frame)
--> 148 return masked_aggregator(sliced_frame)
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:228, in _flatten_temporal_spec.<locals>.<lambda>.<locals>.<lambda>(sliced_frame)
218 def _flatten_temporal_spec(
219 spec: TemporalSpec, predictiontime_frame: PredictionTimeFrame, value_frame: ValueFrame
220 ) -> list[pl.DataFrame]:
221 return (
222 Iter(spec.normalised_lookperiod)
223 .map(
224 lambda lookperiod: _slice_and_aggregate_spec(
225 timedelta_frame=_get_timedelta_frame(
226 predictiontime_frame=predictiontime_frame, value_frame=value_frame
227 ),
--> 228 masked_aggregator=lambda sliced_frame: _aggregate_masked_frame(
229 aggregators=spec.aggregators, fallback=spec.fallback, masked_frame=sliced_frame
230 ),
231 time_masker=lambda timedelta_frame: _mask_outside_lookperiod(
232 timedelta_frame=timedelta_frame,
233 lookperiod=lookperiod,
234 column_prefix=spec.column_prefix,
235 value_col_names=spec.value_frame.value_col_names,
236 ),
237 )
238 )
239 .flatten()
240 .to_list()
241 )
File ~/work/timeseriesflattener/timeseriesflattener/src/timeseriesflattener/processors/temporal.py:134, in _aggregate_masked_frame(masked_frame, aggregators, fallback)
122 value_columns = (
123 Iter(grouped_frame.columns)
124 .filter(
(...)
129 .map(lambda old_name: (old_name, f"{old_name}_fallback_{fallback}"))
130 )
131 rename_mapping = dict(value_columns)
133 with_fallback = grouped_frame.with_columns(
--> 134 cs.contains(*masked_frame.value_col_names).fill_null(fallback)
135 ).rename(rename_mapping)
137 return with_fallback
TypeError: contains() takes 1 positional argument but 10 were given
Let’s check the output.
import polars as pl
import polars.selectors as cs
# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()
entity_id | timestamp | prediction_time_uuid | pred_tfidf_and_within_0_to_365_days_mean_fallback_nan | pred_tfidf_for_within_0_to_365_days_mean_fallback_nan | pred_tfidf_in_within_0_to_365_days_mean_fallback_nan | pred_tfidf_of_within_0_to_365_days_mean_fallback_nan | pred_tfidf_or_within_0_to_365_days_mean_fallback_nan | pred_tfidf_patient_within_0_to_365_days_mean_fallback_nan | pred_tfidf_that_within_0_to_365_days_mean_fallback_nan | pred_tfidf_the_within_0_to_365_days_mean_fallback_nan | pred_tfidf_to_within_0_to_365_days_mean_fallback_nan | pred_tfidf_was_within_0_to_365_days_mean_fallback_nan | pred_tfidf_and_within_0_to_730_days_mean_fallback_nan | pred_tfidf_for_within_0_to_730_days_mean_fallback_nan | pred_tfidf_in_within_0_to_730_days_mean_fallback_nan | pred_tfidf_of_within_0_to_730_days_mean_fallback_nan | pred_tfidf_or_within_0_to_730_days_mean_fallback_nan | pred_tfidf_patient_within_0_to_730_days_mean_fallback_nan | pred_tfidf_that_within_0_to_730_days_mean_fallback_nan | pred_tfidf_the_within_0_to_730_days_mean_fallback_nan | pred_tfidf_to_within_0_to_730_days_mean_fallback_nan | pred_tfidf_was_within_0_to_730_days_mean_fallback_nan |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | datetime[μs] | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
6840 | 1965-11-02 07:17:00 | "6840-1965-11-02 07:17:00.00000… | 0.155821 | 0.376386 | 0.258256 | 0.573168 | 0.355142 | 0.071452 | 0.096561 | 0.28581 | 0.45603 | 0.092896 | 0.155821 | 0.376386 | 0.258256 | 0.573168 | 0.355142 | 0.071452 | 0.096561 | 0.28581 | 0.45603 | 0.092896 |
2039 | 1966-04-20 05:06:00 | "2039-1966-04-20 05:06:00.00000… | 0.108015 | 0.0 | 0.596744 | 0.11352 | 0.0 | 0.099062 | 0.133872 | 0.693431 | 0.210747 | 0.257581 | 0.108015 | 0.0 | 0.596744 | 0.11352 | 0.0 | 0.099062 | 0.133872 | 0.693431 | 0.210747 | 0.257581 |
9496 | 1966-12-06 06:44:00 | "9496-1966-12-06 06:44:00.00000… | 0.279955 | 0.0 | 0.30933 | 0.294222 | 0.0 | 0.256749 | 0.0 | 0.513498 | 0.546216 | 0.3338 | 0.279955 | 0.0 | 0.30933 | 0.294222 | 0.0 | 0.256749 | 0.0 | 0.513498 | 0.546216 | 0.3338 |
7281 | 1967-06-05 00:41:00 | "7281-1967-06-05 00:41:00.00000… | 0.289663 | 0.04373 | 0.280049 | 0.304425 | 0.385111 | 0.332065 | 0.269251 | 0.464891 | 0.211934 | 0.388547 | 0.289663 | 0.04373 | 0.280049 | 0.304425 | 0.385111 | 0.332065 | 0.269251 | 0.464891 | 0.211934 | 0.388547 |
7424 | 1967-07-13 15:01:00 | "7424-1967-07-13 15:01:00.00000… | 0.153907 | 0.092941 | 0.170056 | 0.107834 | 0.389756 | 0.282299 | 0.063583 | 0.682222 | 0.475452 | 0.0 | 0.153907 | 0.092941 | 0.170056 | 0.107834 | 0.389756 | 0.282299 | 0.063583 | 0.682222 | 0.475452 | 0.0 |
And just like that, you’re ready to make a prediction model!