Adding text features#
So far, the tutorials have dealt with tabular data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within timeseriesflattener
.
Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.
The dataset#
To start out, let’s load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.
from __future__ import annotations
from timeseriesflattener.testing.load_synth_data import load_synth_text
synth_text = load_synth_text()
synth_text.head()
entity_id | timestamp | value |
---|---|---|
i64 | datetime[μs] | str |
4647 | 1967-07-19 00:22:00 | "The patient went into a medica… |
2007 | 1966-11-25 02:02:00 | "The patient is taken to the em… |
5799 | 1967-09-19 12:31:00 | "The patient, described as a 7-… |
1319 | 1969-07-21 23:16:00 | "The patient had been left on a… |
4234 | 1966-04-14 22:04:00 | "The patient had had some sever… |
Generating predictors from embedded text#
As generating text embeddings can often take a while, it can be an advantageous to embed the text before using timeseriesflattener
to speed up the computation if you’re generating multiple datasets. This first block will show you how to format a dataframe with embedded text for timeseriesflattener
.
To start, let’s embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an entity_id_col
, timestamp_col
and any number of columns containing the embeddings, with a single value in each column.
For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.
%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer
# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
tfidf_model = TfidfVectorizer(max_features=10)
embeddings = tfidf_model.fit_transform(text)
return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())
# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(columns=["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")
embedded_text_with_metadata.head()
entity_id | timestamp | and | for | in | of | or | patient | that | the | to | was |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | datetime[μs] | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
4647 | 1967-07-19 00:22:00 | 0.175872 | 0.182066 | 0.249848 | 0.15843 | 0.0 | 0.023042 | 0.311389 | 0.529966 | 0.490203 | 0.479312 |
2007 | 1966-11-25 02:02:00 | 0.24487 | 0.0 | 0.135282 | 0.064337 | 0.465084 | 0.336859 | 0.151743 | 0.729861 | 0.179161 | 0.0 |
5799 | 1967-09-19 12:31:00 | 0.192367 | 0.232332 | 0.283402 | 0.336952 | 0.0 | 0.176422 | 0.238416 | 0.646879 | 0.250217 | 0.382277 |
1319 | 1969-07-21 23:16:00 | 0.165635 | 0.200046 | 0.183015 | 0.261115 | 0.125837 | 0.151906 | 0.205285 | 0.759528 | 0.403961 | 0.098747 |
4234 | 1966-04-14 22:04:00 | 0.493461 | 0.119196 | 0.272619 | 0.207444 | 0.0 | 0.045256 | 0.183475 | 0.588324 | 0.433253 | 0.235349 |
Now that we have our embeddings in a dataframe including the entity_id
and timestamp
, we can simply pass it to PredictorSpec
!
import datetime as dt
import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator
text_spec = PredictorSpec(
ValueFrame(
init_df=embedded_text_with_metadata,
entity_id_col_name="entity_id",
value_timestamp_col_name="timestamp",
),
lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
aggregators=[MeanAggregator()],
fallback=np.nan,
column_prefix="pred_tfidf",
)
Let’s make some features!
We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.
# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
flattener = Flattener(
predictiontime_frame=PredictionTimeFrame(
init_df=load_synth_prediction_times(),
entity_id_col_name="entity_id",
timestamp_col_name="timestamp",
)
)
df = flattener.aggregate_timeseries(specs=[text_spec]).df
Processing spec: ['and', 'for', 'in', 'of', 'or', 'patient', 'that', 'the', 'to', 'was']
Let’s check the output.
import polars as pl
import polars.selectors as cs
# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()
entity_id | timestamp | prediction_time_uuid | pred_tfidf_and_within_0_to_365_days_mean_fallback_nan | pred_tfidf_for_within_0_to_365_days_mean_fallback_nan | pred_tfidf_in_within_0_to_365_days_mean_fallback_nan | pred_tfidf_of_within_0_to_365_days_mean_fallback_nan | pred_tfidf_or_within_0_to_365_days_mean_fallback_nan | pred_tfidf_patient_within_0_to_365_days_mean_fallback_nan | pred_tfidf_that_within_0_to_365_days_mean_fallback_nan | pred_tfidf_the_within_0_to_365_days_mean_fallback_nan | pred_tfidf_to_within_0_to_365_days_mean_fallback_nan | pred_tfidf_was_within_0_to_365_days_mean_fallback_nan | pred_tfidf_and_within_0_to_730_days_mean_fallback_nan | pred_tfidf_for_within_0_to_730_days_mean_fallback_nan | pred_tfidf_in_within_0_to_730_days_mean_fallback_nan | pred_tfidf_of_within_0_to_730_days_mean_fallback_nan | pred_tfidf_or_within_0_to_730_days_mean_fallback_nan | pred_tfidf_patient_within_0_to_730_days_mean_fallback_nan | pred_tfidf_that_within_0_to_730_days_mean_fallback_nan | pred_tfidf_the_within_0_to_730_days_mean_fallback_nan | pred_tfidf_to_within_0_to_730_days_mean_fallback_nan | pred_tfidf_was_within_0_to_730_days_mean_fallback_nan |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | datetime[μs] | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
6840 | 1965-11-02 07:17:00 | "6840-1965-11-02 07:17:00.00000… | 0.155821 | 0.376386 | 0.258256 | 0.573168 | 0.355142 | 0.071452 | 0.096561 | 0.28581 | 0.45603 | 0.092896 | 0.155821 | 0.376386 | 0.258256 | 0.573168 | 0.355142 | 0.071452 | 0.096561 | 0.28581 | 0.45603 | 0.092896 |
2039 | 1966-04-20 05:06:00 | "2039-1966-04-20 05:06:00.00000… | 0.108015 | 0.0 | 0.596744 | 0.11352 | 0.0 | 0.099062 | 0.133872 | 0.693431 | 0.210747 | 0.257581 | 0.108015 | 0.0 | 0.596744 | 0.11352 | 0.0 | 0.099062 | 0.133872 | 0.693431 | 0.210747 | 0.257581 |
9496 | 1966-12-06 06:44:00 | "9496-1966-12-06 06:44:00.00000… | 0.279955 | 0.0 | 0.30933 | 0.294222 | 0.0 | 0.256749 | 0.0 | 0.513498 | 0.546216 | 0.3338 | 0.279955 | 0.0 | 0.30933 | 0.294222 | 0.0 | 0.256749 | 0.0 | 0.513498 | 0.546216 | 0.3338 |
7281 | 1967-06-05 00:41:00 | "7281-1967-06-05 00:41:00.00000… | 0.289663 | 0.04373 | 0.280049 | 0.304425 | 0.385111 | 0.332065 | 0.269251 | 0.464891 | 0.211934 | 0.388547 | 0.289663 | 0.04373 | 0.280049 | 0.304425 | 0.385111 | 0.332065 | 0.269251 | 0.464891 | 0.211934 | 0.388547 |
7424 | 1967-07-13 15:01:00 | "7424-1967-07-13 15:01:00.00000… | 0.153907 | 0.092941 | 0.170056 | 0.107834 | 0.389756 | 0.282299 | 0.063583 | 0.682222 | 0.475452 | 0.0 | 0.153907 | 0.092941 | 0.170056 | 0.107834 | 0.389756 | 0.282299 | 0.063583 | 0.682222 | 0.475452 | 0.0 |
And just like that, you’re ready to make a prediction model!