Adding text features#

So far, the tutorials have dealt with tabular data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within timeseriesflattener.

Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.

The dataset#

To start out, let’s load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.

from __future__ import annotations

from timeseriesflattener.testing.load_synth_data import load_synth_text
synth_text = load_synth_text()
synth_text.head()
shape: (5, 3)
entity_idtimestampvalue
i64datetime[μs]str
46471967-07-19 00:22:00"The patient went into a medica…
20071966-11-25 02:02:00"The patient is taken to the em…
57991967-09-19 12:31:00"The patient, described as a 7-…
13191969-07-21 23:16:00"The patient had been left on a…
42341966-04-14 22:04:00"The patient had had some sever…

Generating predictors from embedded text#

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using timeseriesflattener to speed up the computation if you’re generating multiple datasets. This first block will show you how to format a dataframe with embedded text for timeseriesflattener.

To start, let’s embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an entity_id_col, timestamp_col and any number of columns containing the embeddings, with a single value in each column.

For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.

%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer


# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
    tfidf_model = TfidfVectorizer(max_features=10)
    embeddings = tfidf_model.fit_transform(text)
    return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())


# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(columns=["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")
embedded_text_with_metadata.head()
shape: (5, 12)
entity_idtimestampandforinoforpatientthatthetowas
i64datetime[μs]f64f64f64f64f64f64f64f64f64f64
46471967-07-19 00:22:000.1758720.1820660.2498480.158430.00.0230420.3113890.5299660.4902030.479312
20071966-11-25 02:02:000.244870.00.1352820.0643370.4650840.3368590.1517430.7298610.1791610.0
57991967-09-19 12:31:000.1923670.2323320.2834020.3369520.00.1764220.2384160.6468790.2502170.382277
13191969-07-21 23:16:000.1656350.2000460.1830150.2611150.1258370.1519060.2052850.7595280.4039610.098747
42341966-04-14 22:04:000.4934610.1191960.2726190.2074440.00.0452560.1834750.5883240.4332530.235349

Now that we have our embeddings in a dataframe including the entity_id and timestamp, we can simply pass it to PredictorSpec!

import datetime as dt

import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator

text_spec = PredictorSpec(
    ValueFrame(
        init_df=embedded_text_with_metadata,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
    aggregators=[MeanAggregator()],
    fallback=np.nan,
    column_prefix="pred_tfidf",
)

Let’s make some features!

We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.

# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=load_synth_prediction_times(),
        entity_id_col_name="entity_id",
        timestamp_col_name="timestamp",
    )
)

df = flattener.aggregate_timeseries(specs=[text_spec]).df
Processing spec: ['and', 'for', 'in', 'of', 'or', 'patient', 'that', 'the', 'to', 'was']


Let’s check the output.

import polars as pl
import polars.selectors as cs

# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()
shape: (5, 23)
entity_idtimestampprediction_time_uuidpred_tfidf_and_within_0_to_365_days_mean_fallback_nanpred_tfidf_for_within_0_to_365_days_mean_fallback_nanpred_tfidf_in_within_0_to_365_days_mean_fallback_nanpred_tfidf_of_within_0_to_365_days_mean_fallback_nanpred_tfidf_or_within_0_to_365_days_mean_fallback_nanpred_tfidf_patient_within_0_to_365_days_mean_fallback_nanpred_tfidf_that_within_0_to_365_days_mean_fallback_nanpred_tfidf_the_within_0_to_365_days_mean_fallback_nanpred_tfidf_to_within_0_to_365_days_mean_fallback_nanpred_tfidf_was_within_0_to_365_days_mean_fallback_nanpred_tfidf_and_within_0_to_730_days_mean_fallback_nanpred_tfidf_for_within_0_to_730_days_mean_fallback_nanpred_tfidf_in_within_0_to_730_days_mean_fallback_nanpred_tfidf_of_within_0_to_730_days_mean_fallback_nanpred_tfidf_or_within_0_to_730_days_mean_fallback_nanpred_tfidf_patient_within_0_to_730_days_mean_fallback_nanpred_tfidf_that_within_0_to_730_days_mean_fallback_nanpred_tfidf_the_within_0_to_730_days_mean_fallback_nanpred_tfidf_to_within_0_to_730_days_mean_fallback_nanpred_tfidf_was_within_0_to_730_days_mean_fallback_nan
i64datetime[μs]strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
68401965-11-02 07:17:00"6840-1965-11-02 07:17:00.00000…0.1558210.3763860.2582560.5731680.3551420.0714520.0965610.285810.456030.0928960.1558210.3763860.2582560.5731680.3551420.0714520.0965610.285810.456030.092896
20391966-04-20 05:06:00"2039-1966-04-20 05:06:00.00000…0.1080150.00.5967440.113520.00.0990620.1338720.6934310.2107470.2575810.1080150.00.5967440.113520.00.0990620.1338720.6934310.2107470.257581
94961966-12-06 06:44:00"9496-1966-12-06 06:44:00.00000…0.2799550.00.309330.2942220.00.2567490.00.5134980.5462160.33380.2799550.00.309330.2942220.00.2567490.00.5134980.5462160.3338
72811967-06-05 00:41:00"7281-1967-06-05 00:41:00.00000…0.2896630.043730.2800490.3044250.3851110.3320650.2692510.4648910.2119340.3885470.2896630.043730.2800490.3044250.3851110.3320650.2692510.4648910.2119340.388547
74241967-07-13 15:01:00"7424-1967-07-13 15:01:00.00000…0.1539070.0929410.1700560.1078340.3897560.2822990.0635830.6822220.4754520.00.1539070.0929410.1700560.1078340.3897560.2822990.0635830.6822220.4754520.0

And just like that, you’re ready to make a prediction model!