Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (2024)

Kyoka Ono Simon A. Lee

Abstract

Recent research has explored how Language Models (LMs) can be used for feature representation and prediction in tabular machine learning tasks. This involves employing text serialization and supervised fine-tuning (SFT) techniques. Despite the simplicity of these techniques, significant gaps remain in our understanding of the applicability and reliability of LMs in this context. Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning and evaluates the feasibility of adopting similar approaches with these advanced technologies. At the data level, we investigate various methods of data representation and curation of serialized tabular data, exploring their impact on prediction performance. At the classification level, we examine whether text serialization combined with LMs enhances performance on tabular datasets (e.g. class imbalance, distribution shift, biases, and high dimensionality), and assess whether this method represents a state-of-the-art (SOTA) approach for addressing tabular machine learning challenges. Our findings reveal current pre-trained models should not replace conventional approaches.

Machine Learning, ICML

1 Introduction

In the field of natural language processing (NLP), a paradigm shift has occurred, driven by the emergence of Language Models (LM) technologies rooted in the transformer architecture (Vaswani etal., 2017). These advancements have led to immense progress across various domains of machine learning (ML) and artificial intelligence (AI). Leveraging sophisticated techniques such as transfer learning (Weiss etal., 2016) and attention mechanisms (Bahdanau etal., 2014), LMs have demonstrated exceptional capabilities in tasks encompassing language understanding (Devlin etal., 2018), translation (Lewis etal., 2019), and text generation (Radford etal., 2018), thereby significantly influencing applications within the field of NLP. However, researchers from various fields have discovered that these LMs are not limited to conventional tasks. Consequently, there has been a surge of research into other areas and domains, such as question-answering (Radford etal., 2019; Su etal., 2019) and mathematical reasoning (Trinh etal., 2024; Wang etal., 2023; Imani etal., 2023), among others.

Therefore, in this paper, we focus on the ability of LMs to solve tabular machine learning tasks as introduced by (Hegselmann etal., 2023; Sahakyan etal., 2021; Dinh etal., 2022; Fang etal., 2024). These studies utilize text serialization—converting tabular data into natural language representations—combined with supervised fine-tuning (SFT) to evaluate LMs’ capability on supervised machine learning tasks. Yet, current papers do not explore whether this process or these LMs could represent a state-of-the-art (SOTA) approach in machine learning. This oversight is especially significant in light of previous assertions that gradient boosting methods outperform deep learning strategies (Grinsztajn etal., 2022).

These previous works also did not determine whether various data curation measures are required for obtaining accurate results and how to adequately handle the common data preparation practices commonly used in tabular machine learning (e.g. missing data, feature scaling, etc.). As a result, there are open questions in the current literature about text serialization and whether they align with conventional machine learning paradigms.

In this work, we explore the unresolved questions related to text serialization. We believe this research is crucial for contrasting the differences between traditional ML methods and emerging methodologies like “text serialization” developed for LM technologies. Thus, we rigorously analyze numerous publicly available tabular datasets and detail the various experiments conducted to gain insights into the current questions in this area of research. We aim to determine whether data curation is necessary and assess whether these pre-trained LMs should be used over traditional tabular solvers like gradient boosting under varying dataset characteristics. The contributions of this paper are as follows:

•
We investigate whether open-source LMs, in conjunction with text serialization, can achieve state-of-the-art (SOTA) performance compared to current ML methods in supervised learning tasks. We aim to determine whether pre-trained models should be preferred over previously established gradient-boosted methods.
•
We investigate how various data curation strategies for text serialization, such as addressing missing values, feature importance, and feature scaling, affect prediction performance. We also consider whether these common protocols should be followed for language modeling.
•
We investigate the adaptability and generalization capabilities of LMs across different characteristics of tabular datasets that are commonly encountered in real world datasets (e.g. high dimensionality, imbalance).
•
We evaluate the robustness of LM-based models against common distribution shifts and dataset biases, examining how their pretrained parameters respond to these characteristics.

2 Related Works

2.1 Text Serialization

Text Serialization introduced by (Hegselmann etal., 2023; Dinh etal., 2022; GIDROL, ; Jaitly etal., 2023; Lee etal., 2024a) created an interface to allow an easy integration with tabular data to LMs by converting tabular data fields into a natural language representations. Since its emergence there have been numerous papers in various applications including healthcare that have adopted a similar approach (Chen etal., 2024a; Kim etal., 2024; Hegselmann etal., 2024; Belyaeva etal., 2023). Lee et al. found that text serialization, in particular, proved effective for handling categorical tabular data with a large number of classes. They observed that a natural language representation outperformed engineered features like one-hot encoding (Lee etal., 2024b). Text serialization has found application in various reasoning tasks, such as feature extraction, enabling systems to extract information from tables or databases to answer queries, as seen in Question and Answer (Q&A) scenarios (Min etal., 2024; Sui etal., 2024; Li etal., 2024).

Following this conversion from tabular to text, the resulting data can be directly input into foundation models (e.g., BERT (Devlin etal., 2018), GPT (Brown etal., 2020), etc.) to obtain rich feature representations in the form of high-fidelity vectors. Recent research has focused extensively on representing numerical data (Gorishniy etal., 2022; Golkar etal., 2023), where these foundation models have demonstrated competitive and often superior performance compared to current models like XGBoost (Chen & Guestrin, 2016) and LGBM (Ke etal., 2017), showing recent evidence against previous claims of boosted methods being the SOTA (Grinsztajn etal., 2022).

2.2 Tabular Deep Leaning

Deep learning has emerged as an exceptional computational framework across numerous disciplines due to its ability to learn complex patterns in large datasets (Zhang etal., 2018; Feng etal., 2019), generalize effectively (Sanh etal., 2021), apply transfer learning techniques (Torrey & Shavlik, 2010; Zhuang etal., 2020; Pan & Yang, 2009; Niu etal., 2020; Levin etal., 2022), and scale with powerful hardware (Mayer & Jacobsen, 2020; Chilimbi etal., 2014; Rouhani etal., 2018). Tabular deep learning has been investigated over the years, yet there remains no consensus on whether it represents the optimal modeling approach for this type of data (Shwartz-Ziv & Armon, 2022; Borisov etal., 2022; Gorishniy etal., 2021). Despite this lack of consensus, many groups continue to explore this field extensively. Examples include TabNet (Arik & Pfister, 2021), TabPFN (Hollmann etal., 2022), SAINT (Somepalli etal., 2021), TabTransformer (Huang etal., 2020), NODE (Popov etal., 2019), and TaBERT (Yin etal., 2020). Kadra et al. demonstrated that even simple neural nets can produce high-performing models compared to baselines (Kadra etal., 2021).

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (1)

More recently, there has been a resurgence of interest in tabular deep learning, driven by advancements in Language Model (LM) technology. Notably, models like TabLLM (Hegselmann etal., 2023), LIFT (Dinh etal., 2022), MEME (Lee etal., 2024b, c), and others (Zhang etal., 2023) have showcased robust performance in both few-shot and fully trained scenarios. However, when evaluating these LMs with zero or few shots, it’s challenging to determine whether they are learning the task (Webson & Pavlick, 2021) or merely hallucinating based on simpler classification tasks, which complicates model evaluation (Ji etal., 2023; Lee & Lindsey, 2024). Nevertheless, fine-tuning these language models enables them to be adapted to perform specific tasks using a few-shot (minimal data) approach (Harari & Katz, 2022; Liu etal., 2022; Perez etal., 2021; Zhao etal., 2021).

Despite recent advances in language models and tabular machine learning, numerous unanswered questions remain regarding the use of language models in this field. Therefore, this study aims to comprehensively address some knowledge gaps concerning the systematic approach to the machine learning pipeline and how these new approaches align with conventional paradigms. Additionally, we would like to highlight and address other common scenarios where pre-trained language models can be beneficial and whether these general models should be adopted and surpass previously state-of-the-art models that are primarily based on gradient boosting. We hypothesize that language models do not adhere to conventional paradigms and do not require data curation techniques, but we believe that these pre-trained models can be effective tabular solvers.

3 Methodology

3.1 Text Serialization

Problem Formulation: Text serialization is the process of transforming structured tabular data $X$ with dimensions $n\times m$ into textual representations. Here, $n$ is the number of samples, and $m$ represents the number of features. In the study by Hegselmann et al. on TabLLM, they identified that using text templates and list readouts provided the best results among various serialization strategies (Hegselmann etal., 2023). Therefore we will adopt a text template approach within our analysis. Mathematically, this transformation can be represented as follows: Let $X=\{x_{ij}\}_{n\times m}$ be the input dataset, where $x_{ij}$ is the value of the $i$ -th sample in the $j$ -th feature. Let $Y=\{y_{i}\}_{n}$ be the corresponding set of labels for each sample in $X$ . The goal of text serialization is to define a mapping $\Phi:X\rightarrow T$ , where $T=\{t_{i}\}_{n}$ represents the serialized text derived from the data in $X$ . This mapping function $\Phi$ uses template filling to convert $x_{i}$ into the corresponding serialized text $t_{i}$ . From this textual representation, we will utilize this data alongside the labels for our supervised fine-tuning in classification tasks.

3.2 Language Model Selection

In our study, we need to select a language model backbone that we will use in the study. There are numerous backbones to choose from, but we filter these by selecting a language model that was pretrained on text classification objectives. Therefore, in order to select the best Language Model (LM) for our benchmark, we will conduct an evaluation on multiple open source LMs sourced from the huggingface sequence classification library (Wolf etal., 2019). We additionally benchmark several models from the Massive Text Embedding Benchmark (MTEB) (Muennighoff etal., 2022)—a comprehensive framework designed to evaluate the performance of text embedding models across a wide range of tasks—and select models based on their rank in text classification. This is in an effort to find the LM that provides the best representation for our serialized textual data. In TableLABEL:TableModel, we highlight which LMs we evaluate with a short description describing each of them.

3.3 Current Understanding and Limitations

What do we know about Text Serialization

From the literature, there are concrete findings that have been made in text serialization. Text serialization has enabled the integration of tabular data with language models (LMs), leading to competitive performance in datasets with minimal samples (few-shot) (Hegselmann etal., 2023; Yang etal., 2024) or no samples at all (zero-shot) (Wei etal., 2021; Kojima etal., 2022; Zhong etal., 2021). This success is due to converting data into a natural language format, which allows for the effective application of transfer learning using hundreds of millions of pre-trained parameters within a LM to carry out inference. While recent works have progressed towards the ability to read structured data (Song etal., 2023; Chen etal., 2024b; Yao etal., 2023), text serialization appears to remain the best method for integrating tabular data with LMs. Another use case of text serialization was identified when tabular data has categorical data with a high number of classes or heterogenous data (numerical, categorical, free text) within the tabular fields (Lee etal., 2024b). This methodology allows us to seamlessly preserve all the data in its natural form (no feature engineering necessary), represented all as text. Groups including (Belyaeva etal., 2023; Chen etal., 2024a) also demonstrated that text serialization was particularly effective when integrated with paired multimodal datasets, enabling contrastive methods to shared latent representations (Radford etal., 2021).

What needs to be addressed

While considerable progress has been made in advancing tabular data with LM technologies, many intermediate steps at both the data and classification levels remain undisclosed. This paper aims to address some of the key gaps in the current literature, providing a more comprehensive understanding of the existing challenges and solutions.

Data Questions:

Many questions remain regarding whether text serialization or LMs adhere to similar approaches as those of traditional machine learning paradigms. This is particularly relevant in the data curation process when handling raw data that contains missing values, the need to identify important and unimportant features, and dealing with differently distributed numerical data. Applying data curation is often a crucial component in traditional machine learning pipelines, but no study has yet examined whether similar approaches are required in LM technologies for supervised tasks. A visualization of this exploration can be seen in Figure 1.

Classification Questions:

In addition, there have been no studies regarding whether pre-trained LMs should be used for all tabular supervised classification tasks. Therefore, we explore several datasets with commonly encountered characteristics and benchmark them against various tabular SOTA models and traditional machine learning methods. We aim to determine whether LMs support or contradict previous claims that gradient boosting performs better than deep learning-based models in tabular tasks (Grinsztajn etal., 2022).

4 Experimental Setup

4.1 Data

In our study, we utilize eight datasets, which we divide into two groups.

DatasetSample Size ( $n$ )# of Features ( $m$ )BinaryIris $150$ $4$ ✗Diabetes7848✓Titanic $\heartsuit$ $891$ $11$ ✓Wine $178$ $13$ ✗HELOC^† $10,459$ $23$ ✓Fraud^♢ $284,807$ $30$ ✓Crime^♣ $878,049$ $8$ ✗Cancer^♠ $801$ $20,533$ ✗

Baseline Datasets:

The first four datasets are commonly used baselines in tabular machine learning. These datasets include the IRIS, Wine, Diabetes, and Titanic Dataset, which are either binary or multiclass (3 class) classification problems sourced from the UCI data repository or previous literature (Asuncion & Newman, 2007; Smith etal., 1988). We utilize these baseline datasets in our data-level experiments to identify which preprocessing steps affect relative performance and should be adapted for our SOTA experiments.

Experimental Datasets:

The second group of datasets can be labeled as a set of datasets with interesting and common machine learning characteristics. We utilize these datasets only in our SOTA evaluation by using the identified preprocessing steps from our previous experiments. These datasets include an Identifying Targets for Cancer Using Gene Expression Profiles dataset, which includes high dimensionality (Fiorini, 2016); the HELOC Dataset (Brown etal., 2018), which contains well documented distribution shift identified by (Gardner etal., 2023); the San Francisco Crime dataset, which contains inherent biases towards certain neighborhoods (Asuncion & Newman, 2007); and the Credit Card Fraud dataset, which contains class imbalance (DalPozzolo etal., 2015) (0.172% of data is fraud). These datasets contain a mixture of binary and multi-class classification tasks. All characteristics to the datasets including sample size and feature size can be found in Table 1. Additional details about what the raw data looks like and how we serialized it in different ways can be found in Appendix Section B.

4.2 Experiments

At the data level, we’ve identified gaps in the literature related to data curation for text serialization and whether they follow approaches similar to traditional machine learning paradigms. To explore the effects of various preprocessing measures on serialized tabular data, we utilize a baseline model, where no data curation is performed. We then explore how applying different preprocessing techniques, affect performance relative to the baseline.

Additionally, at the classification level, we are interested in testing the robustness of LMs when faced with commonly encountered real-life dataset characteristics. With the identified requisite from our data curation experiments, we evaluate the LMs against existing methods and commonly used ML methods on datasets that exhibit class imbalance, distribution shift among other. By introducing these challenges into our benchmark datasets, we aim to evaluate the relative performance of LMs in tackling fundamentally difficult challenges in tabular machine learning. We detail our experiments in greater detail in the proceeding subsections.

Data Experiments

Feature Selection

Feature selection is the process of identifying and selecting a subset of relevant features from the original set of features to improve model performance and efficiency (Guyon & Elisseeff, 2003). In our first experiment, we compare a baseline model, where no feature selection is applied, to a model where feature selection is utilized. We employ two feature selection methods: one using SHapley Additive exPlanations (SHAP) values extracted from an XGBoost model and another using the ANOVA F-test (St etal., 1989). Further details on how these features are derived from these methods can be found in Appendix Section E. We then assess whether feature selection yields better, worse, or nuanced results. We further include serialized text in the appendix to give readers a view what these sentences look like with and without feature selection attached in Appendix Section B.

Feature Scaling & Outlier Handling

Feature scaling involves converting features within a dataset to ensure they are on a similar scale, thus preventing certain features from dominating others in the analysis. We explore standardizing features (subtract the mean ( $\mu$ ) and divide by the standard deviation ( $\sigma$ )) when they are on different scales and the machine learning algorithm is scale-sensitive. We normalize features (rescale to the range [0, 1]) to bring all features to a common range, particularly in the presence of outliers. Additionally, we apply log transformation when the data is skewed or contains outliers, as it can help mitigate the impact of extreme values and make the distribution more normal. These measures are applied to the Titanic (Eaton & Haas, 1995) datasets based on the characteristics of its dataset, and we report whether such steps are necessary.

Missing Data Handling & Imputation

Missing data handling and imputation involve techniques for addressing and filling in missing values within a dataset to ensure completeness and maintain the integrity of the analysis. Unlike in traditional tabular machine learning, a clear method for handling missing values remains unclear. Therefore, we explore the effects of ignoring missing values (equivalent to dropping that single cell) and adding filler sentence techniques as a form of imputation, similar to those described in (Lee etal., 2024b). We will then perform a sensitivity analysis observing how much the logarithm of odds (logits) for each class change based on these imputation strategies.

Classification Experiments

SOTA Benchmarks on Various Tabular Datasets

In our classification experiments we are particularly interested in seeing how LM perform compared to traditional Machine learning models, and several models from the literature. We test for SOTA in all the baseline datasets as well as our experimental datasets referenced in Section 4.1. These include datasets with high dimensionality, distribution shift, bias, and class imbalance. We don’t perform data corrections (e.g. SMOTE, etc.), and instead want to assess the performance with these included characteristics.

4.3 Benchmarking Baseline Models

To evaluate the relative performance of text serialization and SFT we identify models commonly used for tabular machine learning to include in the benchmark that have excelled at tabular tasks. The models included were sourced from (Dinh etal., 2022; Hegselmann etal., 2023) in the evaluation include Support Vector Machines (SVM) with the Radial basis function (RBF) kernel (Cortes & Vapnik, 1995), Light Gradient boosted machines (Ke etal., 2017), and XGBoost (Chen & Guestrin, 2016). From the literature we also use Tabnet (Arik & Pfister, 2021) and TabPFN (Hollmann etal., 2022) which were optimized on tabular tasks. The metrics we will use to evaluate models include: f1, accuracy, Area Under the Receiver Operating Characteristic (AUROC), and mathew’s correlation coefficient (MCC) (Chicco & Jurman, 2020). When classification objectives are not binary we include the macro averaging strategy to create a uniform view of performance metrics across all methods.

4.4 Training and Model Optimization

In terms of optimizing the language model performance, we elect for a standard learning rate of 2e-4 with a learning rate scheduler to tune this parameter dynamically. We also include a dropout of 0.3 to ensure that these models are not overfitting during fine tuning. We elect for a batch size of 64 on each dataset. We minimize on the Binary Cross Entropy loss for binary classification, and Cross-entropy for multi-class classification. We evaluate our models using Pytorch and use LMs sourced from huggingface. We do all evaluations on a single Tesla V100 GPU with 16GB of VRAM.

In the standard machine learning models, we elect to conduct a five-fold cross-validation grid search to find optimal hyperparameters for the benchmark. We showcase these hyperparameters we searched for in the appendix for reproducibility purposes (Appendix Section D).

5 Results

5.1 Language Model Evaulation

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (2)

We begin our analysis by identifying our “TabLM” through a benchmarking study on a set of language models sourced from the Huggingface sequence classification and Massive Text Embedding Benchmark (MTEB) (Muennighoff etal., 2022). We conducted this analysis using the Titanic baseline dataset and used serialized text template inputs to conduct a proper evaluation. From Figure 2, we find that DistilBERT is the best performing model, and we elect this to be our Tabular Language Model, which we will refer to as TabLM. One notable finding from this evaluation is that the MTEB’s ranking of text classification is not compatible with tabular machine learning tasks, as evidenced by standard models outperforming the General Text Embedding (GTE) model (Li etal., 2023). Additionally, the varying fluctuations across performance metrics illustrate how different pre-training objectives incorporated within these foundational models may optimize different performance metrics. Further details are located in (Section C.1).

5.2 Data Curation Results

Feature Selection

In our feature selection experiment, we compare the performance between a baseline language model (LM) without feature selection and an LM that uses shorter serialized sentences containing only important features. These features are identified through XGBoost feature importance and visualized using SHapley Additive exPlanations (SHAP) values and ANOVA F-tests.

DatasetWithout Feature SelectionWith Feature SelectionImproved?MetricsAUROCF1AUROCF1Iris1.0001.0001.0001.000—Wine0.9520.9440.9760.972✓Diabetes0.6540.6210.6590.659✓Titanic $\heartsuit$ 0.7860.8710.7770.852✗

This study reveals that feature selection appears to have a positive effect on both F1 score and AUROC in most evaluation datasets, as seen in Table 2. While the results are somewhat nuanced, we observe that selecting appropriate features for serialization tends to enhance performance in classification tasks and will likely be true in datasets with higher dimensionality.

Feature Scaling & Outlier Handling

In our experiment on feature scaling and outlier handling, we benchmark models that serialize their numerical data using various feature scaling methods to compare their performance across multiple metrics. This evaluation specifically focuses on the Titanic dataset, which exhibits right-skewed distributions in both the fare and age features. To address these issues, we employ standardization, normalization, and logarithmic transformations on these features, applying corrections that offer different benefits as detailed in Section 4. Each method is analyzed for its effectiveness in mitigating the impact of skewness and improving model performance, providing a comprehensive understanding of how feature scaling can influence key performance indicators.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (3)

Our analysis reveals nuanced results, where various feature scaling methods yield marginal gains and deficits. Based on these results, we advise that scaling methods should be applied in accordance with the classification objectives. However, it is also likely possible to achieve acceptable results without performing feature scaling.

State of the Art Evaluation - Baseline DatasetsDatasetMethodAccuracyF1AUROCMCCCurrent State of the ArtTabLM SOTA?IrisSVM (RBF)1.00001.00001.00001.18701.0000 (Acc)(Ojha & Nicosia, 2020)✗LGBM1.00001.00001.00001.1870XGBoost1.00001.00001.00001.1870TabNet1.00001.00001.00001.1870TabPFN1.00001.0000—1.1870TabLM1.00001.00001.00001.1870WineSVM (RBF)0.83330.81070.94141.20040.9800 (Acc) (Di etal., 2020)✗LGBM1.00001.00001.00001.2089XGBoost0.97220.96631.00001.2133TabNet0.83330.84970.95030.7306TabPFN0.98000.9785—0.9704TabLM0.97220.97611.00001.2147DiabetesSVM (RBF)0.76620.74110.80440.48330.7879 (Acc) (Sarkar, 2022)✗LGBM0.75320.73340.81290.4671XGBoost0.75970.73010.82350.4640TabNet0.72730.62500.85250.4329TabPFN0.76620.74330.82110.4870TabLM0.64230.65940.65930.3962Titanic $\heartsuit$ SVM (RBF)0.77650.76870.86540.53760.7985 (Acc) (Sarkar, 2022)✓LGBM0.78770.77470.89950.5572XGBoost0.79890.78890.89580.5812TabNet0.82120.76120.89380.6192TabPFN0.81010.73440.47470.5923TabLM0.82120.77770.85210.6001

Handling Missing Data & Imputation

Lastly in our experiments regarding missing data handling, we evaluated a baseline model that ignores missing values by not serializing any text (equivalent to dropping that cell in tabular data). We then tested two strategies for imputing filler sentences into serialized data. The first strategy (Model: Impute 1) used a sentence that had no relevance to the classification objective, while the second (Model: Impute 2) used a filler sentence related to the classification objectives. We analyzed the differences denoted as $\Delta$ in the logirthm of odds (logits) by subtracting the logits of the two imputed models from the baseline logits to assess how the logits for each class were affected by the imputation. Logits centered at the origin (0,0) indicated that they were typically unaltered, whereas logits that deviated from the origin were heavily altered.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (4)

Our results, displayed in Figure 4, reveal that imputing sentences similar to those seen in (Lee etal., 2024b) should be used with caution, as they appear to cause significant changes in $\Delta$ , potentially altering the final class prediction. This could lead to learning the distribution of the imputed data, which can greatly affect performance, particularly if there is a substantial amount of missing data within a specific feature.

State of the Art Evaluation - Experimental DatasetsDatasetMethodAccuracyF1AUROCMCCCurrent State of the ArtTabLm SOTAHELOC $\dagger$ SVM (RBF)0.72230.72070.79030.4426N/A✗LGBM0.72800.72670.79580.4541XGBoost0.71700.71570.77460.4321TabNet0.72750.70700.79660.4532TabPFN0.7500*0.7253*0.4519*0.5014*TabLM0.71570.70250.79390.4331Fraud $\diamondsuit$ SVM (RBF)0.99830.49960.47900.00000.9530 (AUROC) (Xu etal., 2023)✗LGBM0.99940.90750.90830.8167XGBoost0.99960.92930.98110.8635TabNet0.99940.82180.96400.8215TabPFN*————TabLM0.99880.92110.91550.8545Crime $\clubsuit$ SVM (RBF)0.20060.00880.48490.2310N/A✓LGBM0.26360.07640.62910.2395XGBoost0.26060.07560.64670.2389TabNet0.30870.05020.71930.2097TabPFN*————TabLM0.32120.06710.67890.2437Cancer $\spadesuit$ SVM (RBF)1.00001.00001.00001.1428N/A✗LGBM1.00001.00001.00001.1428XGBoost1.00001.00001.00001.1428TabNet0.98140.97350.99940.9749TabPFN*————TabLM0.98330.98260.98640.9792

5.3 SOTA Benchmark

Having identified the preprocessing steps that are generally beneficial to language models and text serialization, we now proceed with a comprehensive benchmark across all baseline and experimental datasets. This benchmark compares our TabLM against traditional ML algorithms and two specific algorithms from recent literature: Tabnet (Arik & Pfister, 2021) and TabPFN ¹¹1WARNING: TabPFN is not suitable for datasets with training sizes above 1024 and feature sizes above 10. Predictions become slower and less reliable as dataset size increases. The authors advise against using TabPFN for datasets with over 10k samples due to potential machine crashes from quadratic memory scaling. Consequently, we do not include evaluations as a result on the Crime, Cancer, and Fraud classification datasets.(Hollmann etal., 2022). We also include current state-of-the-art models identified by competitions and the open web to showcase what the actual highest metric is. To this end, we introduce a separate column in our benchmarks to highlight these methods as well as their winning performance metric.

6 Discussion

6.1 Language Models benefit from Feature Selection

From our study on data curation, we identified that among the three techniques, feature selection was the only beneficial data curation strategy. Other strategies, such as feature scaling and handling missing data, showed negative or nuanced results, suggesting that their inclusion could lead to adverse outcomes. Therefore, based on our findings, we advise researchers who use language models on tabular tasks to apply these data curation techniques with caution. We therefore believe more work has to be done in identifying appropriate serialization strategies.

6.2 Serialization Sensitivity

Previous studies (Hegselmann etal., 2023) and our experiments with imputation indicate that the logarithm of odds is highly sensitive to minor modifications in the serialized text. Hegselmann et al. found that list readouts and text templates were the most effective serialization strategies. However, our analysis suggests that engineering the input text could significantly enhance or reduce the performance of various language models in classification tasks.

6.3 When do I use LM For Tabular tasks?

From this evaluation, it is not conclusively evident that traditional ML techniques or neural network models designed for tabular tasks should be replaced by emerging language model (LM) techniques. These language models were not optimized for tabular tasks, and it appears challenging to fine-tune these models without large datasets. This is evident in our baseline experiments where all the datasets had sample sizes less than 1000. This situation is analogous to other deep learning methodologies that require substantial data to tune the large number of parameters and are at risk of overfitting to the training set. Regarding the experimental datasets with larger sample sizes, it also appears that pre-training and transfer learning offer little benefit to these tasks and do not enhance predictive performance.

Therefore, while our TabLM model reached SOTA accuracy levels for specific tasks, other methodologies often yielded more robust results across the board. This finding suggests that these models may not be universally suitable for tabular tasks. However, these models were still competitive, despite not always achieving SOTA performance levels. Extensive research is ongoing to optimize LMs and more recently LLMs for performing tasks on structured data. However, we believe that pre-trained language models should not replace conventional models, and we support the notion that traditional models are still better suited for tabular tasks than deep learning methods (Grinsztajn etal., 2022).

7 Conclusion

In this study, we conducted a series of experiments related to text serialization and compared them to traditional machine learning paradigms. We assessed how various preprocessing steps could enhance or diminish the performance of models. We also performed an benchmarking evaluation against traditional ML models, and two tabular deep learning models and found that pre-trained language models are not better than these exisiting methods. We therefore conclude that pre-trained models are not better than gradient boosted methods.

Code and Data

All code can be found in the Github. All data is in Appendix Section B.

7.1 Impact Statement

This work aims to advance Tabular Machine Learning by comparing modern NLP language models (LMs) with traditional paradigms. While not covering all aspects of text serialization and tabular characteristics, the study reveals a generally analogous behavior across the evaluated models.

References

Aeberhard & Forina (1991)Aeberhard, S. and Forina, M.Wine.UCI Machine Learning Repository, 1991.DOI: https://doi.org/10.24432/C5PC7J.
Arik & Pfister (2021)Arik, S.Ö. and Pfister, T.Tabnet: Attentive interpretable tabular learning.In Proceedings of the AAAI conference on artificial intelligence, volume35, pp. 6679–6687, 2021.
Asuncion & Newman (2007)Asuncion, A. and Newman, D.Uci machine learning repository, 2007.
Bahdanau etal. (2014)Bahdanau, D., Cho, K., and Bengio, Y.Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.
Beltagy etal. (2020)Beltagy, I., Peters, M.E., and Cohan, A.Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020.
Belyaeva etal. (2023)Belyaeva, A., Cosentino, J., Hormozdiari, F., Eswaran, K., Shetty, S., Corrado, G., Carroll, A., McLean, C.Y., and Furlotte, N.A.Multimodal llms for health grounded in individual-specific data.In Workshop on Machine Learning for Multimodal Healthcare Data, pp. 86–102. Springer, 2023.
Borisov etal. (2022)Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022.
Brown etal. (2018)Brown, K., Doran, D., Kramer, R., and Reynolds, B.Heloc applicant risk performance evaluation by topological hierarchical decomposition.arXiv preprint arXiv:1811.10658, 2018.
Brown etal. (2020)Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
Chen etal. (2024a)Chen, E., Kansal, A., Chen, J., Jin, B.T., Reisler, J., Kim, D.E., and Rajpurkar, P.Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine.Advances in Neural Information Processing Systems, 36, 2024a.
Chen & Guestrin (2016)Chen, T. and Guestrin, C.Xgboost: A scalable tree boosting system.In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
Chen etal. (2024b)Chen, W., Yuan, C., Yuan, J., Su, Y., Qian, C., Yang, C., Xie, R., Liu, Z., and Sun, M.Beyond natural language: Llms leveraging alternative formats for enhanced reasoning and communication.arXiv preprint arXiv:2402.18439, 2024b.
Chicco & Jurman (2020)Chicco, D. and Jurman, G.The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation.BMC genomics, 21:1–13, 2020.
Chilimbi etal. (2014)Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K.Project adam: Building an efficient and scalable deep learning training system.In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp. 571–582, 2014.
Clark etal. (2020)Clark, K., Luong, M.-T., Le, Q.V., and Manning, C.D.Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555, 2020.
Cortes & Vapnik (1995)Cortes, C. and Vapnik, V.Support-vector networks.Machine learning, 20:273–297, 1995.
DalPozzolo etal. (2014)DalPozzolo, A., Caelen, O., LeBorgne, Y.-A., Waterschoot, S., and Bontempi, G.Learned lessons in credit card fraud detection from a practitioner perspective.Expert systems with applications, 41(10):4915–4928, 2014.
DalPozzolo etal. (2015)DalPozzolo, A., Caelen, O., Johnson, R.A., and Bontempi, G.Calibrating probability with undersampling for unbalanced classification.In 2015 IEEE symposium series on computational intelligence, pp. 159–166. IEEE, 2015.
DalPozzolo etal. (2017)DalPozzolo, A., Boracchi, G., Caelen, O., Alippi, C., and Bontempi, G.Credit card fraud detection: a realistic modeling and a novel learning strategy.IEEE transactions on neural networks and learning systems, 29(8):3784–3797, 2017.
Devlin etal. (2018)Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
Di etal. (2020)Di, X., Yu, P., Bu, R., and Sun, M.Mutual information maximization in graph neural networks.In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE, 2020.
Dinh etal. (2022)Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., Sohn, J.-y., Papailiopoulos, D., and Lee, K.Lift: Language-interfaced fine-tuning for non-language machine learning tasks.Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
Eaton & Haas (1995)Eaton, J.P. and Haas, C.Titanic: Triumph and tragedy.WW Norton & Company, 1995.
Fang etal. (2024)Fang, X., Xu, W., Tan, F.A., Zhang, J., Hu, Z., Qi, Y., Nickleach, S., Socolinsky, D., Sengamedu, S., and Faloutsos, C.Large language models(llms) on tabular data: Prediction, generation, and understanding – a survey, 2024.
Feng etal. (2019)Feng, S., Chen, Q., Gu, G., Tao, T., Zhang, L., Hu, Y., Yin, W., and Zuo, C.Fringe pattern analysis using deep learning.Advanced photonics, 1(2):025001–025001, 2019.
Fiorini (2016)Fiorini, S.gene expression cancer RNA-Seq.UCI Machine Learning Repository, 2016.DOI: https://doi.org/10.24432/C5R88H.
Fisher (1988)Fisher, R.A.Iris.UCI Machine Learning Repository, 1988.DOI: https://doi.org/10.24432/C56C76.
Gardner etal. (2023)Gardner, J., Popovic, Z., and Schmidt, L.Benchmarking distribution shift in tabular data with tableshift.Advances in Neural Information Processing Systems, 2023.
(29)GIDROL, J.-B.Text classification with llms.
Golkar etal. (2023)Golkar, S., Pettee, M., Eickenberg, M., Bietti, A., Cranmer, M., Krawezik, G., Lanusse, F., McCabe, M., Ohana, R., Parker, L., etal.xval: A continuous number encoding for large language models.arXiv preprint arXiv:2310.02989, 2023.
Gorishniy etal. (2021)Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A.Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
Gorishniy etal. (2022)Gorishniy, Y., Rubachev, I., and Babenko, A.On embeddings for numerical features in tabular deep learning.Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
Grinsztajn etal. (2022)Grinsztajn, L., Oyallon, E., and Varoquaux, G.Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022.
Guyon & Elisseeff (2003)Guyon, I. and Elisseeff, A.An introduction to variable and feature selection.Journal of machine learning research, 3(Mar):1157–1182, 2003.
Harari & Katz (2022)Harari, A. and Katz, G.Few-shot tabular data enrichment using fine-tuned transformer architectures.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1577–1591, 2022.
He etal. (2020)He, P., Liu, X., Gao, J., and Chen, W.Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654, 2020.
Hegselmann etal. (2023)Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D.Tabllm: Few-shot classification of tabular data with large language models.In International Conference on Artificial Intelligence and Statistics, pp. 5549–5581. PMLR, 2023.
Hegselmann etal. (2024)Hegselmann, S., Shen, S.Z., Gierse, F., Agrawal, M., Sontag, D., and Jiang, X.A data-centric approach to generate faithful and high quality patient summaries with large language models.arXiv preprint arXiv:2402.15422, 2024.
Hollmann etal. (2022)Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F.Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848, 2022.
Huang etal. (2020)Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z.Tabtransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678, 2020.
Imani etal. (2023)Imani, S., Du, L., and Shrivastava, H.Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023.
Jaitly etal. (2023)Jaitly, S., Shah, T., Shugani, A., and Grewal, R.S.Towards better serialization of tabular data for few-shot classification with large language models, 2023.
Ji etal. (2023)Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.Towards mitigating llm hallucination via self reflection.In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1827–1843, 2023.
Kadra etal. (2021)Kadra, A., Lindauer, M., Hutter, F., and Grabocka, J.Well-tuned simple nets excel on tabular datasets.Advances in neural information processing systems, 34:23928–23941, 2021.
Kan (2015)Kan, W.San francisco crime classification, 2015.URL https://kaggle.com/competitions/sf-crime.
Kaplan etal. (2020)Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020.
Ke etal. (2017)Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y.Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017.
Kim etal. (2024)Kim, Y., Xu, X., McDuff, D., Breazeal, C., and Park, H.W.Health-llm: Large language models for health prediction via wearable sensor data.arXiv preprint arXiv:2401.06866, 2024.
Kojima etal. (2022)Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., and Iwasawa, Y.Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022.
Lan etal. (2019)Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R.Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019.
Lee & Lindsey (2024)Lee, S.A. and Lindsey, T.Do large language models understand medical codes?arXiv preprint arXiv:2403.10822, 2024.
Lee etal. (2024a)Lee, S.A., Brokowski, T., and Chiang, J.N.Enhancing antibiotic stewardship using a natural language approach for better feature representation.arXiv preprint arXiv:2405.20419, 2024a.
Lee etal. (2024b)Lee, S.A., Jain, S., Chen, A., Biswas, A., Fang, J., Rudas, A., and Chiang, J.N.Multimodal clinical pseudo-notes for emergency department prediction tasks using multiple embedding model for ehr (meme).arXiv preprint arXiv:2402.00160, 2024b.
Lee etal. (2024c)Lee, S.A., Jain, S., Chen, A., Ono, K., Fang, J., Rudas, A., and Chiang, J.N.Emergency department decision support using clinical pseudo-notes, 2024c.
Levin etal. (2022)Levin, R., Cherepanova, V., Schwarzschild, A., Bansal, A., Bruss, C.B., Goldstein, T., Wilson, A.G., and Goldblum, M.Transfer learning with deep tabular models.arXiv preprint arXiv:2206.15306, 2022.
Lewis etal. (2019)Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L.Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
Li etal. (2024)Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Geng, R., Huo, N., etal.Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024.
Li etal. (2023)Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., and Zhang, M.Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023.
Liu etal. (2022)Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C.A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning.Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
Liu etal. (2019)Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
Mayer & Jacobsen (2020)Mayer, R. and Jacobsen, H.-A.Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools.ACM Computing Surveys (CSUR), 53(1):1–37, 2020.
Min etal. (2024)Min, D., Hu, N., Jin, R., Lin, N., Chen, J., Chen, Y., Li, Y., Qi, G., Li, Y., Li, N., etal.Exploring the impact of table-to-text methods on augmenting llm-based question answering with domain hybrid data.arXiv preprint arXiv:2402.12869, 2024.
Muennighoff etal. (2022)Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022.
Niu etal. (2020)Niu, S., Liu, Y., Wang, J., and Song, H.A decade survey of transfer learning (2010–2020).IEEE Transactions on Artificial Intelligence, 1(2):151–166, 2020.
Ojha & Nicosia (2020)Ojha, V. and Nicosia, G.Multi-objective optimisation of multi-output neural trees.In 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE, 2020.
Pan & Yang (2009)Pan, S.J. and Yang, Q.A survey on transfer learning.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
Perez etal. (2021)Perez, E., Kiela, D., and Cho, K.True few-shot learning with language models.Advances in neural information processing systems, 34:11054–11070, 2021.
Popov etal. (2019)Popov, S., Morozov, S., and Babenko, A.Neural oblivious decision ensembles for deep learning on tabular data.arXiv preprint arXiv:1909.06312, 2019.
Radford etal. (2018)Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., etal.Improving language understanding by generative pre-training.2018.
Radford etal. (2019)Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
Radford etal. (2021)Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
Rouhani etal. (2018)Rouhani, B.D., Riazi, M.S., and Koushanfar, F.Deepsecure: Scalable provably-secure deep learning.In Proceedings of the 55th annual design automation conference, pp. 1–6, 2018.
Sahakyan etal. (2021)Sahakyan, M., Aung, Z., and Rahwan, T.Explainable artificial intelligence for tabular data: A survey.IEEE access, 9:135392–135422, 2021.
Sanh etal. (2019)Sanh, V., Debut, L., Chaumond, J., and Wolf, T.Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019.
Sanh etal. (2021)Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A., etal.Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207, 2021.
Sarkar (2022)Sarkar, T.Xbnet: An extremely boosted neural network.Intelligent Systems with Applications, 15:200097, 2022.
Shwartz-Ziv & Armon (2022)Shwartz-Ziv, R. and Armon, A.Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, 2022.
Smith etal. (1988)Smith, J.W., Everhart, J.E., Dickson, W., Knowler, W.C., and Johannes, R.S.Using the adap learning algorithm to forecast the onset of diabetes mellitus.In Proceedings of the annual symposium on computer application in medical care, pp. 261. American Medical Informatics Association, 1988.
Somepalli etal. (2021)Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., and Goldstein, T.Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342, 2021.
Song etal. (2023)Song, Y., Xiong, W., Zhu, D., Li, C., Wang, K., Tian, Y., and Li, S.Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023.
St etal. (1989)St, L., Wold, S., etal.Analysis of variance (anova).Chemometrics and intelligent laboratory systems, 6(4):259–272, 1989.
Su etal. (2019)Su, D., Xu, Y., Winata, G.I., Xu, P., Kim, H., Liu, Z., and Fung, P.Generalizing question answering system with pre-trained language model fine-tuning.In Proceedings of the 2nd workshop on machine reading for question answering, pp. 203–211, 2019.
Sui etal. (2024)Sui, Y., Zhou, M., Zhou, M., Han, S., and Zhang, D.Table meets llm: Can large language models understand structured table data? a benchmark and empirical study.In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp. 645–654, 2024.
Torrey & Shavlik (2010)Torrey, L. and Shavlik, J.Transfer learning.In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI global, 2010.
Trinh etal. (2024)Trinh, T.H., Wu, Y., Le, Q.V., He, H., and Luong, T.Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024.
Vaswani etal. (2017)Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I.Attention is all you need.Advances in neural information processing systems, 30, 2017.
Wang etal. (2023)Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., Zhang, R., Song, L., Zhan, M., and Li, H.Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023.
Webson & Pavlick (2021)Webson, A. and Pavlick, E.Do prompt-based models really understand the meaning of their prompts?arXiv preprint arXiv:2109.01247, 2021.
Wei etal. (2021)Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., and Le, Q.V.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021.
Weiss etal. (2016)Weiss, K., Khoshgoftaar, T.M., and Wang, D.A survey of transfer learning.Journal of Big data, 3:1–40, 2016.
Wolf etal. (2019)Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., etal.Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019.
Xu etal. (2023)Xu, H., Pang, G., Wang, Y., and Wang, Y.Deep isolation forest for anomaly detection.IEEE Transactions on Knowledge and Data Engineering, 2023.
Yang etal. (2024)Yang, Y., Mishra, S., Chiang, J.N., and Mirzasoleiman, B.Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models.arXiv preprint arXiv:2403.07384, 2024.
Yang etal. (2019)Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., and Le, Q.V.Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32, 2019.
Yao etal. (2023)Yao, L., Zhang, Y., Yan, Z., and Tian, J.Sai: Solving ai tasks with systematic artificial intelligence in communication network.arXiv preprint arXiv:2310.09049, 2023.
Yin etal. (2020)Yin, P., Neubig, G., Yih, W.-t., and Riedel, S.Tabert: Pretraining for joint understanding of textual and tabular data.arXiv preprint arXiv:2005.08314, 2020.
Zhang etal. (2023)Zhang, H., Wen, X., Zheng, S., Xu, W., and Bian, J.Towards foundation models for learning on tabular data.arXiv preprint arXiv:2310.07338, 2023.
Zhang etal. (2018)Zhang, P., Liu, S., Chaurasia, A., Ma, D., Mlodzianoski, M.J., Culurciello, E., and Huang, F.Analyzing complex single-molecule emission patterns with deep learning.Nature methods, 15(11):913–916, 2018.
Zhao etal. (2021)Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh, S.Calibrate before use: Improving few-shot performance of language models.In International conference on machine learning, pp. 12697–12706. PMLR, 2021.
Zhong etal. (2021)Zhong, R., Lee, K., Zhang, Z., and Klein, D.Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections.arXiv preprint arXiv:2104.04670, 2021.
Zhuang etal. (2020)Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q.A comprehensive survey on transfer learning.Proceedings of the IEEE, 109(1):43–76, 2020.

Appendix

In the appendix we cover the following sections:

•
Section A: Supplementary Section
•
Section B: Datasets
•
Section C: Foundation Models Table
•
Section D: Hyperparameters of ML Models
•
Section E: Feature Selection Methods
•
Section F: Metrics

Appendix A Supplementary Section

Limitations

One notable limitation of language models in tabular tasks is that they are computationally demanding and costly in terms of runtime compared to methods such as SVMs and gradient boosting. A graphic illustrating runtime at inference is shown in Figure 5. Another limitation of language model technologies is the accessibility and lack of inclusivity they create due to their computational demands. We acknowledge that not all groups have access to GPU hardware, which represents a significant barrier in this field of work. In this study we therefore elected to use small LM over recent large language models (LLM) due to their the ability to run a local instance without the need for advanced hardware, and the reproduciblility of this work.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (5)

Another notable limitation of this work was highlighted in Section 6.2. Specifically, serialized sentences appear to heavily influence the raw prediction probabilities. While the premise of TabLLM (Hegselmann etal., 2023) explored this issue, our research, combined with theirs, still leaves lingering questions about the appropriate strategy for text serialization.

Future Works:

Future work should focus on exploring their scalability and performance as the number of parameters increases (Kaplan etal., 2020). As LLMs grow in populatity and size, they demonstrate enhanced capabilities in understanding and producing SOTA performance, but this also introduces challenges related to accessibility to computational resources, and model optimization.

Appendix B Datasets

B.1 Baseline Datasets

Iris Dataset

The Iris dataset (Fisher, 1988) is a classic dataset in the field of machine learning and statistics, often used for benchmarking classification algorithms. It consists of 150 samples divided equally among three species of Iris flowers: Iris setosa, Iris versicolor, and Iris virginica. Each sample in the dataset is described by four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. These features are used to predict the species of the Iris flower, making it a multiclass classification problem. The dataset is well-balanced, with 50 samples from each species, providing a clear example for exploring and demonstrating the capabilities of various classification techniques, from simple linear models to more complex, nonlinear classifiers.

Link: Iris Dataset

sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)label5.13.51.40.204.93.01.40.204.73.21.30.20

{mdframed}

Serialized Text:
The Iris has sepal Length is 5.1 centimeters. Sepal width is 3.5 centimeters. Petal length is 1.4 centimeters. Petal width is 0.2 centimeters.

Wine Datset

The Wine (Aeberhard & Forina, 1991) dataset is a well-regarded dataset in the machine learning community, commonly used to evaluate multiclass classification algorithms. It comprises 178 instances from three different types of Italian wine: Barolo, Grignolino, and Barbera, derived from the Piedmont region. The dataset is characterized by thirteen attributes, including alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline. These attributes are chemically significant and contribute to differentiating one type of wine from another. The objective is to classify each wine into one of the three categories based on its chemical makeup, making it a typical example of a multiclass classification problem.

Link: Wine Dataset

alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesprolinelabel14.21.72.415.6127.02.83.10.32.35.61.03.91065.0013.21.82.111.2100.02.62.80.31.34.41.13.41050.0013.22.42.718.6101.02.83.20.32.85.71.03.21185.00

{mdframed}

Serialized Text:
My wine has an Alcohol percentage of 14.2%. The Malic Acid is 1.7 grams per liter. Ash is 2.4 grams per liter. Alcalinity of ash is 15.6 pH. Magnesium is 127 milligrams per liter. Total Phenols is 2.8 milligrams per liter. Flavanoids is 3.1 milligrams per liter. Nonflavanoid phenols is 0.3 milligrams per liter. Proanthocyanins is 2.3 milligrams per liter. Color intensity is 5.6. Hue is 1.0. OD280/OD315 of diluted wines is 3.9. Proline is 1065."

Feature	Score
Alcohol	99.18
Malic Acid	33.47
Ash	11.16
Alkalinity of Ash	28.68
Magnesium	5.52
Total Phenols	78.24
Flavanoids	272.00
Nonflavanoid Phenols	26.65
Proanthocyanins	25.28
Color Intensity	101.34
Hue	85.70
OD280/OD315 of Diluted Wines	175.80
Proline	151.48

{mdframed}

Feature Selected Serialized Text:
My wine has an Alcohol percentage of 14.2%. The Malic Acid is 1.7 grams per liter. Ash is 2.4 grams per liter. Total Phenols is 2.8 milligrams per liter. Flavanoids is 3.1 milligrams per liter. Color intensity is 5.6. Hue is 1.0. OD280/OD315 of diluted wines is 3.9. Proline is 1065."

Diabetes Dataset

The Diabetes dataset (Smith etal., 1988), often referred to as the Pima Indians Diabetes Database, is a frequently used dataset in the domain of medical informatics for predicting the onset of diabetes based on diagnostic measures. This dataset consists of 768 instances, each representing a female at least 21 years old of Pima Indian heritage. The dataset encompasses several medical predictor variables including the number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skinfold thickness, 2-hour serum insulin, body mass index, diabetes pedigree function, and age. The target variable indicates whether the individual was diagnosed with diabetes (1) or not (0), making it a binary classification problem. This dataset is pivotal in the development and testing of predictive models aimed at diagnosing diabetes early and has been instrumental in numerous studies related to machine learning in healthcare.

Link: Diabetes Dataset

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome61487235033.60.65011856629026.60.43108183640023.30.7321

{mdframed}

Serialized Text:
The Age is 50. The Number of times pregnant is 6. The Diastolic blood pressure is 72. The Triceps skin fold thickness is 32. The Plasma glucose concentration at 2 hours in an oral glucose tolerance test (GTT) is 148. The 2-hour serum insulin is 0. The Body mass index is 33.6. The Diabetes pedigree function is 0.6.”

Feature	Importance Score
Pregnancies	23.93
Glucose	163.60
Blood Pressure	2.04
Skin Thickness	4.80
Insulin	8.92
BMI	62.25
Diabetes Pedigree Function	16.77
Age	37.07

{mdframed}

Feature Selected Serialized Text:
The Age is 50. The Number of times pregnant is 6. The Plasma glucose concentration at 2 hours in an oral glucose tolerance test (GTT) is 148. The Body mass index is 33.6. The Diabetes pedigree function is 0.6.”

Titanic Dataset

The Titanic dataset (Eaton & Haas, 1995) is one of the most iconic datasets used in the realm of data science, especially for beginners practicing classification techniques. It comprises passenger records from the tragic maiden voyage of the RMS Titanic in 1912. This dataset typically includes 891 instances, representing a subset of the total passenger list. Each instance includes various attributes such as passenger class (Pclass), name, sex, age, number of siblings/spouses aboard (SibSp), number of parents/children aboard (Parch), ticket number, fare, cabin number, and port of embarkation. The primary objective with this dataset is to predict a passenger’s survival (1 for survived, 0 for did not survive), making it a binary classification problem. The Titanic dataset not only challenges model builders to predict survival outcomes accurately but also provides an opportunity to explore data preprocessing techniques like handling missing values, feature engineering, and categorical data encoding. It serves as a practical introduction to machine learning tasks and is frequently used in educational settings to demonstrate the steps involved in the data science workflow from preprocessing to model evaluation.

Link: Titanic Dataset

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked103Braund, Mr. Owen Harrismale22.010A/5 211717.2NaNS211Cumings, Mrs. John Bradley (Florence Briggs Tha…female38.010PC 1759971.3C85C313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9NaNS

{mdframed}

Serialized Text:
Passenger Name is Mr. Own Harris Broaund. Passenger is 22-years-old. Passenger is male. They paid $7.2. They are in 3rd-class ticket. They embarked from Southhampton. They are with 1 sibling(s)/spouse(s). They are with 0 parent(s)/children. They are staying in cabin Unknown.

{mdframed}

Modified Serialized Text: (SOTA)
Passenger Mr. Own Harris Broaund, a 22-year-old male, paid $7.2 for a 3rd-class ticket and embarked from Southhampton. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children, they were aboard in cabin Unknown.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (6)

{mdframed}

Feature Selected Serialized Text:
Passenger Mr. Own Harris Broaund, a 22-year-old male, paid $7.2 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

B.2 Experimental Datasets

Home Equity Line of Credit (HELOC) Dataset

The Home Equity Line of Credit (HELOC) dataset is a rich resource for data scientists and machine learning practitioners focusing on financial decision-making processes. This dataset, sourced from real loan applications, includes data from applicants who applied for a home equity line of credit from a lending institution. It features approximately 10,459 instances, each characterized by a series of attributes that are critical in assessing creditworthiness and risk. These attributes include borrower’s credit score, loan to value ratio, number of derogatory remarks, total credit balance, and more, comprising a total of 23 predictive attributes plus a binary target variable. The target variable indicates whether the applicant was approved (1) or rejected (0) for the loan, setting up a binary classification problem. The HELOC dataset not only tests a model’s ability to predict loan approval based on complex interactions between various financial indicators but also pushes the boundaries of responsible AI by emphasizing the need for fair and unbiased decision-making systems in finance. It serves as an excellent basis for developing and refining models that deal with imbalanced data, process personal financial information, and require careful feature engineering and selection to predict outcomes accurately.

Link: HELOC Data

RiskPerformanceExternalRiskEstimateMSinceOldestTradeOpenMSinceMostRecentTradeOpenAverageMInFileNumSatisfactoryTradesNumTrades60Ever2DerogPubRecNumTrades90Ever2DerogPubRecPercentTradesNeverDelqMSinceMostRecentDelqBad551444842030832Bad61581541244100-7Bad6766524900100-7

MaxDelq2PublicRecLast12MMaxDelqEverNumTotalTradesNumTradesOpeninLast12MPercentInstallTradesMSinceMostRecentInqexcl7daysNumInqLast6MNumInqLast6Mexcl7daysNetFractionRevolvingBurdenNetFractionInstallBurden352314300033-80870670000-87894440445366

{mdframed}

Serialized Text:
External Risk Estimate is 55.Months Since Oldest Trade Open is 144.Months Since Most Recent Trade Open is 4.Average Months In File is 84.Number of Satisfactory Trades is 20.Number of Trades 60 Ever 2 Derogatory/Public Records is 3.Number of Trades 90 Ever 2 Derogatory/Public Records is 0.Percent Trades Never Delinquent is 83.Months Since Most Recent Delinquency is 2.Max Delinquency 2 Public Record Last 12 Months is 3.Maximum Delinquency Ever is 5.Number of Total Trades is 23.Number of Trades Open in Last 12 Months is 1.Percent Installment Trades is 43.Months Since Most Recent Inquiry Excluding Last 7 Days is 0.Number of Inquiries Last 6 Months is 0.Number of Inquiries Last 6 Months Excluding Last 7 Days is 0.Net Fraction Revolving Burden is 33.Net Fraction Installment Burden is -8.Number of Revolving Trades with Balance is 8.Number of Installment Trades with Balance is 1.Number of Bank/National Trades with High Utilization is 1.Percent of Trades with Balance is 69.

Feature	Importance Score
External Risk Estimate	390.94
Months Since Oldest Trade Open	282.23
Months Since Most Recent Trade Open	14.51
Average Months In File	371.41
Number of Satisfactory Trades	113.51
Number of Trades 60 Ever 2 Derog/Public Rec	45.44
Number of Trades 90 Ever 2 Derog/Public Rec	20.50
Percent Trades Never Delinquent	116.84
Months Since Most Recent Delinquency	33.35
Max Delinquency 2 Public Rec Last 12 Months	98.07
Max Delinquency Ever	96.19
Number of Total Trades	64.18
Number of Trades Open in Last 12 Months	10.90
Percent Installment Trades	116.30
Months Since Most Recent Inquiry excl 7 days	103.23
Number of Inquiries Last 6 Months	65.35
Number of Inquiries Last 6 Months excl 7 days	58.71
Net Fraction Revolving Burden	811.45
Net Fraction Installment Burden	67.57
Number of Revolving Trades with Balance	19.75
Number of Installment Trades with Balance	13.88
Number of Bank/National Trades with High Utilization	6.33
Percent of Trades with Balance	337.51

{mdframed}

Feature Selected Serialized Text:
External Risk Estimate is 55.Months Since Oldest Trade Open is 144.Average Months In File is 84.Number of Satisfactory Trades is 20.Percent Trades Never Delinquent is 83.Max Delinquency 2 Public Record Last 12 Months is 3.Maximum Delinquency Ever is 5.Number of Total Trades is 23.Percent Installment Trades is 43.Months Since Most Recent Inquiry Excluding Last 7 Days is 0.Number of Inquiries Last 6 Months is 0.Number of Inquiries Last 6 Months Excluding Last 7 Days is 0.Net Fraction Revolving Burden is 33.Net Fraction Installment Burden is -8.Percent of Trades with Balance is 69.

Credit Card Fraud Dataset

The Credit Card Fraud dataset (DalPozzolo etal., 2014, 2017, 2015), available on Kaggle, is a critical dataset in the financial sector for the development and testing of anomaly detection systems. This dataset contains transactions made by credit cards in September 2013 by European cardholders. It consists of 284,807 transactions, where each transaction is represented by 31 features. These features include 28 numerical input variables (V1 to V28) which are the result of a Principal Component Analysis (PCA) transformation to protect sensitive information, the transaction amount (Amount), and the time since the first transaction in the dataset (Time). The target variable is binary, indicating fraud (’1’) or not fraud (’0’), making it a binary classification problem. The dataset is highly imbalanced, with fraud transactions making up only 0.172% of all transactions. This dataset challenges researchers to effectively detect fraudulent transactions in a highly imbalanced data setting, which is crucial for preventing financial losses due to fraud and is extensively used in machine learning research focused on fraud detection.

Link: Fraud Dataset

Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15
0.0	-1.4	-0.1	2.5	1.4	-0.3	0.5	0.2	0.1	0.4	0.1	-0.6	-0.6	-1.0	-0.3	1.5
0.0	1.2	0.3	0.2	0.4	0.1	-0.1	-0.1	0.1	-0.3	-0.2	1.6	1.1	0.5	-0.1	0.6
1.0	-1.4	-1.3	1.8	0.4	-0.5	1.8	0.8	0.2	-1.5	0.2	0.6	0.1	0.7	-0.2	2.3

V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	Amount
-0.5	0.2	0.0	0.4	0.3	-0.0	0.3	-0.1	0.1	0.1	-0.2	0.1	-0.0	149.6
0.5	-0.1	-0.2	-0.1	-0.1	-0.2	-0.6	0.1	-0.3	0.2	0.1	-0.0	0.0	2.7
-2.9	1.1	-0.1	-2.3	0.5	0.2	0.8	0.9	-0.7	-0.3	-0.1	-0.1	-0.1	378.7

{mdframed}

Serialized Transaction Data:
V1 is -1.4. V2 is -0.1. V3 is 2.5. V4 is 1.4. V5 is -0.3. V6 is 0.5. V7 is 0.2. V8 is 0.1. V9 is 0.4. V10 is 0.1. V11 is -0.6. V12 is -0.6. V13 is -1.0. V14 is -0.3. V15 is 1.5.V16 is -0.5. V17 is 0.2. V18 is 0.0. V19 is 0.4. V20 is 0.3, V21 is -0.0. V22 is 0.3. V23 is -0.1. V24 is 0.1. V25 is 0.1. V26 is -0.2. V27 is 0.1. V28 is -0.0.

Feature	Importance Score
V1	2527.72
V2	1998.44
V3	9026.38
V4	4002.88
V5	2345.90
V6	428.86
V7	8861.27
V8	87.15
V9	2133.98
V10	10886.90
V11	5309.16
V12	15834.84
V13	4.13
V14	21806.04
V15	4.06
V16	8917.15
V17	27131.19
V18	2917.22
V19	270.12
V20	93.85
V21	478.77
V22	1.30
V23	1.10
V24	8.64
V25	3.87
V26	4.44
V27	15.92
V28	37.68
Amount	8.72

{mdframed}

Serialized Transaction Data:
V1 is -1.4. V2 is -0.1. V3 is 2.5. V4 is 1.4. V5 is -0.3. V6 is 0.5. V7 is 0.2. V9 is 0.4. V10 is 0.1. V11 is -0.6. V12 is -0.6. V14 is -0.3.V16 is -0.5. V17 is 0.2. V18 is 0.0. V19 is 0.4. V20 is 0.3, V21 is -0.0.

San Francisco Crime Dataset

The San Francisco Crime dataset (Kan, 2015), available on Kaggle, is an extensive dataset widely used in the domain of predictive modeling and public safety analytics. It includes incidents derived from the San Francisco Police Department’s crime incident reporting system, spanning over 12 years from 2003 to 2015. This dataset features over 878,049 instances, each documented with several attributes such as dates, police department district, the category of the crime, the description of the incident, day of the week, and geographical coordinates (latitude and longitude).

The primary objective with this dataset is to predict the category of crime that occurred, making it a multiclass classification problem. Each record is classified into one of 39 distinct crime categories, which include varying offenses from larceny/theft, non-criminal, assault, to drug/narcotic violations. This dataset challenges data scientists to analyze and predict crime patterns based on temporal and spatial features, which is crucial for law enforcement agencies to allocate resources effectively and improve public safety. The San Francisco Crime dataset not only serves as a critical resource for training machine learning models to understand urban crime dynamics but also provides insights into the effectiveness of different policing strategies over time.

Link: Crime Dataset

DatesCategoryDescriptDayOfWeekPdDistrictResolutionAddressXY2015-05-13 23:53:00WARRANTSWARRANT ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42537.7742015-05-13 23:53:00OTHER OFFENSESTRAFFIC VIOLATION ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42537.7742015-05-13 23:33:00OTHER OFFENSESTRAFFIC VIOLATION ARRESTWednesdayNORTHERNARREST, BOOKEDVANNESS AV / GREENWICH ST-122.42437.800

{mdframed}

Serialized Sentence:
The description of the incident was WARRANT ARREST. The crime happened on Wednesday in the NORTHERN police district. The incident happened at OAK ST / LAGUNA ST, with coordinates (-122.4, 37.8).

Gene Expression Profiles for Cancer Target Identification Dataset

The Gene Expression Profiles dataset (Fiorini, 2016) is a vital resource in the burgeoning field of machine learning for drug discovery, specifically in identifying targets for cancer therapies. This dataset consists of gene expression profiles derived from various cancer patients. It includes data from multiple studies focused on different types of cancer, where each sample is described by potentially thousands of gene expression features, reflecting the activity levels of various genes in the tissues sampled from cancer patients.

The primary objective with this dataset is to distinguish between different cancer types or to predict the response of various cancers to treatments, making it an essential tool for multiclass classification or regression problems in biomedical research. The complexity of the dataset, due to the high dimensionality of the feature space and the biological variability among samples, poses significant challenges in model building, feature selection, and interpretation of results.

Link: Cancer Dataset

gene_0gene_1gene_2gene_3gene_4gene_5gene_6gene_7gene_8gene_9gene_10gene_11gene_12gene_13gene_14gene_15gene_16gene_17…gene_200000.02.03.35.510.40.07.20.60.00.00.61.32.00.60.00.00.00.0…0.40.00.61.67.69.60.06.80.00.00.00.00.62.51.00.00.00.00.0…0.00.03.54.36.99.90.07.00.50.00.00.00.52.01.10.00.00.00.0…1.3

{mdframed}

Serialized Text:
Gene 0 is 0.0. Gene 1 is 0.6. Gene 2 is 1.6. Gene 3 is 7.6. Gene 4 is 9.6. Gene 5 is 0.0. Gene 6 is 6.8. Gene 7 is 0.0. Gene 8 is 0.0. Gene 9 is 0.0.
Gene 10 is 0.0. Gene 11 is 0.6. Gene 12 is 2.5. Gene 13 is 1.0. Gene 14 is 0.0. Gene 15 is 0.0. Gene 16 is 0.0. Gene 17 is 0.0. Gene 18 is 0.0. Gene 19 is 11.1.
Gene 20 is 3.6. Gene 21 is 0.0. Gene 22 is 10.1. Gene 23 is 0.0. Gene 24 is 0.0. Gene 25 is 0.0. Gene 26 is 9.9. Gene 27 is 8.5. Gene 28 is 1.2. Gene 29 is 4.9.
Gene 30 is 0.0. Gene 31 is 0.0. Gene 32 is 5.8. Gene 33 is 1.3. Gene 34 is 13.3. Gene 35 is 6.7. Gene 36 is 0.6. Gene 37 is 0.0. Gene 38 is 9.5. Gene 39 is 0.8.
Gene 40 is 9.7. Gene 41 is 0.0. Gene 42 is 0.3. Gene 43 is 0.0. Gene 44 is 2.7. Gene 45 is 6.7. Gene 46 is 9.8. Gene 47 is 8.8. Gene 48 is 11.5...

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (7)

B.3 Feature Scaling Experiments

We performed various feature scaling techniques to correct the skewness of the Titanic data set. Below we display examples of serialized sentences with the applied transforms.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (8)

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (9)

Standardization

z=\frac{x-\mu}{\sigma}

(1)

These symbols denote the following: $z$ represents the standardized value, $x$ stands for the original value, $\mu$ denotes the mean of the data, and $\sigma$ signifies the standard deviation of the data.

{mdframed}

Standardized Selected Serialized Text:
Passenger Mr. Own Harris Broaund, a -0.565-year-old male, paid $-0.502 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

Normalization

x_{\text{norm}}=\frac{x}{\text{max}(x)}

(2)

In this context, $x_{\text{norm}}$ denotes the normalized value of $x$ , where $x$ stands for the original value, and $\text{max}(x)$ represents the maximum value in the dataset.

{mdframed}

Normalized Serialized Text:
Passenger Mr. Own Harris Broaund, a 0.271-year-old male, paid $0.014 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

Log Transformation

y=\log(x)

(3)

In this context, $y$ represents the logarithmically transformed value of $x$ , where $x$ stands for the original value.

{mdframed}

Feature Selected Serialized Text:
Passenger Mr. Own Harris Broaund, a 3.135-year-old male, paid $2.110 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

Appendix C Foundation Models Table

Model	Description
BERT (Devlin etal., 2018)	Originally pretrained on a corpus consisting of Wikipedia and BookCorpus using masked language modeling (MLM) and next sentence prediction (NSP) tasks to generate bidirectional context representations.
DistilBERT (Sanh etal., 2019)	A lighter version of BERT, retaining most of its predecessor’s capabilities but with fewer parameters, pretrained using a knowledge distillation process during the MLM task.
RoBERTa (Liu etal., 2019)	A variant of BERT optimized through more extensive training on larger data and removing the NSP task, focusing solely on the MLM for better performance.
Electra (Clark etal., 2020)	Trained using a replaced token detection rather than masked language modeling, Electra discriminates between ”real” and ”fake” tokens across a corpus, allowing for more efficient learning.
XLNet (Yang etal., 2019)	Combines the best of autoregressive and autoencoding techniques, pretrained on a permutation-based language modeling task, which captures bidirectional contexts dynamically.
Albert (Lan etal., 2019)	A lite BERT that introduces parameter-reduction techniques to increase training speed and lower memory consumption, focusing on MLM and sentence-order prediction.
Deberta (He etal., 2020)	Enhances BERT and RoBERTa models by incorporating disentangled attention and a new way of encoding positional information, improving on MLM and NSP tasks.
GPT-2 (Radford etal., 2019)	Utilizes a left-to-right autoregressive approach in its pretraining, allowing each token to condition on the previous tokens in a sequence, optimized for a variety of natural language understanding tasks.
Longformer (Beltagy etal., 2020)	Designed for longer texts, this model extends the BERT architecture by employing a combination of sliding window and global attention mechanisms, focusing on efficiency and scalability.
GTE Large (Li etal., 2023)	The general text embedding model (GTE) using a multi-contrastive learning pre-training objective. Scored very high in the MTEB benchmark in Text Classification.
GTE Base	Similar to GTE Large but with fewer parameters, focused on achieving comparable performance to larger models while being more computationally efficient.

C.1 Results of Language Model Evaluation

Model	Loss	Accuracy	Precision	Recall	F1 Score	AUROC	AUPRC	Runtime (s)	Samples/s
Bert	0.4903	0.7821	0.7536	0.7027	0.7273	0.8483	0.8262	5.0933	35.144
DistilBert	0.4535	0.8045	0.7097	0.8919	0.7904	0.8743	0.8426	2.6072	68.656
RoBERTa	0.5547	0.7989	0.7317	0.8108	0.7692	0.8206	0.7448	4.7434	37.737
Electra	0.4583	0.8268	0.7529	0.8649	0.8050	0.8515	0.7665	5.1101	35.029
XLNet	0.5574	0.7821	0.7536	0.7027	0.7273	0.8529	0.8222	17.336	10.325
Albert	0.4802	0.7989	0.7262	0.8243	0.7722	0.8387	0.7637	5.8252	30.729
DeBERTa	0.5057	0.7933	0.7342	0.7838	0.7582	0.8059	0.7006	3.2567	54.964
GPT2	0.6947	0.6592	0.8824	0.2027	0.3297	0.8408	0.7877	2.0704	86.456
Longformer	0.5092	0.7989	0.7436	0.7838	0.7632	0.8138	0.6742	3.7726	47.447
GTE-large	0.5226	0.7933	0.7761	0.7027	0.7376	0.8704	0.7947	6.4885	27.587
GTE Base	0.5336	0.7821	0.9070	0.5270	0.6667	0.8725	0.8139	2.1677	82.575

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (10)

Appendix D Hyperparameters of Baseline Models

Hyperparameters play a pivotal role in machine learning, governing various aspects of the model training process. In our work, we utilized grid search to systematically explore the optimal settings for different models. We extracted hyperparameters from the (Hegselmann etal., 2023) paper for the LGBM and XGBoost mdoels. For XGBoost, we configured parameters such as max_depth ranging from 2 to 12, lambda_l1 and lambda_l2 from $1e-8$ to $1.0$ , and eta from 0.01 to 0.3. For LightGBM, we examined num_leaves from 2 to 4096, lambda_l1 and lambda_l2 extending up to 10.0, and learning_rate from 0.01 to 0.3. The SVM model with an RBF kernel was tested with C values between 0.1 and 100, and gamma values including 0.001 to 1, as well as auto and scale. This comprehensive hyperparameter tuning enhances the model’s performance by ensuring the most effective parameter combinations are identified, leading to improved accuracy and robustness.

XGBoost

Model

xgb.XGBClassifier(random_state=42)

Parameters

max_depth: [2, 4, 6, 8, 10, 12]

lambda_l1: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]

lambda_l2: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]

eta: [0.01, 0.03, 0.1, 0.3]

LightGBM

Model

lgb.LGBMClassifier(random_state=42)

Parameters

num_leaves: [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

lambda_l1: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]

lambda_l2: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]

learning_rate: [0.01, 0.03, 0.1, 0.3]

SVM (RBF)

Model

SVC(probability=True, random_state=42)

Parameters

C: [0.1, 1, 10, 100]

gamma: [0.001, 0.01, 0.1, 1, ’auto’, ’scale’]

kernel: [’rbf’]

Appendix E Feature Selection Methods

ANOVA F-test

The ANOVA F-test feature selection method works by computing the ANOVA F-value between each feature and the target variable for classification tasks. The ANOVA F-value is a ratio of the between-group variability to the within-group variability, and it measures how well a feature can separate the samples into different classes.

Mathematically, the ANOVA F-value for a feature $X$ and a target variable $Y$ with $k$ classes is calculated as follows:

1.
Calculate the mean of $X$ within each class: $\mu_{j}=\sum_{Y_{i}=j}X_{i}/n_{j}$ , where $n_{j}$ is the number of samples in class $j$
2.
Calculate the overall mean of $X$ : $\mu=\sum_{i}X_{i}/n$ , where $n$ is the total number of samples

Calculate the between-group sum of squares (SSB):

\text{SSB}=\sum_{j}n_{j}(\mu_{j}-\mu)^{2}

Calculate the within-group sum of squares (SSW):

\text{SSW}=\sum_{Y_{i}=j}(X_{i}-\mu_{j})^{2}

5.
Calculate the ANOVA F-value:
$F=\frac{\text{SSB}/(k-1)}{\text{SSW}/(n-k)}$

The higher the F-value, the more discriminative the feature is for separating the classes.

SHAP Values:

The SHAP (SHapley Additive exPlanations) value is a method to explain the output of an XGBoost model $f$ for a given input vector $x=(x_{1},x_{2},\ldots,x_{p})$ . The SHAP value $\phi_{j}(x)$ for feature $j$ and instance $x$ is calculated as:

\phi_{j}(x)=\sum_{S\subseteq N\setminus\{j\}}\frac{|S|!(|N|-|S|-1)!}{|N|!}[f_{%x}(S\cup\{j\})-f_{x}(S)]

where:

•
$N=\{1,2,\ldots,p\}$ is the set of all feature indices.
•
$S$ is a subset of feature indices from $N$ , representing a coalition of features.
•
$f_{x}(S)$ is the prediction of the model $f$ for instance $x$ using only the features indexed by $S$ .
•
$|S|$ is the cardinality (number of elements) of the set $S$ .

The SHAP value $\phi_{j}(x)$ represents the weighted average of the marginal contributions of feature $j$ to the model’s prediction, with the weights derived from the Shapley value formulation in cooperative game theory.

Specifically, the term $[f_{x}(S\cup\{j\})-f_{x}(S)]$ denotes the marginal contribution of feature $j$ to the prediction when it is added to the coalition of features $S$ . The weight $\frac{|S|!(|N|-|S|-1)!}{|N|!}$ is the Shapley value weight, ensuring a fair distribution of the total prediction among the features.

To compute the SHAP values, the XGBoost model needs to be evaluated on all possible subsets of features, which can be computationally intensive for high-dimensional datasets. However, efficient approximation algorithms are available in the SHAP library that estimate the SHAP values with reasonable accuracy.

After computing the SHAP values, they can be used for feature selection by ranking the features based on their average absolute SHAP values or by applying a threshold to identify the most important features.

Appendix F Metrics

For pedgogical purposes we define the metrics used in the study.

F.1 Binary Classification Metrics

Accuracy

\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}

Where $TP$ represents True Positives, $TN$ represents True Negatives, $FP$ represents False Positives, and $FN$ represents False Negatives.

F1 Score

\text{F1}=2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+%\text{Recall}}

Where Precision is calculated as $\frac{TP}{TP+FP}$ and Recall is calculated as $\frac{TP}{TP+FN}$ .

Area Under the Receiver Operating Characteristic Curve (AUROC)

\text{AUROC}=\int_{0}^{1}TPR(FPR)\,d(\text{FPR})

Where $TPR$ is the True Positive Rate, calculated as $\frac{TP}{TP+FN}$ , and $FPR$ is the False Positive Rate, calculated as $\frac{FP}{FP+TN}$ .

Matthews Correlation Coefficient (MCC)

\text{MCC}=\frac{TP\cdot TN-FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

F.2 Multiclass Classification Metrics

Accuracy

\text{Accuracy}=\frac{\sum_{i}TP_{i}}{\sum_{i}(TP_{i}+FP_{i}+FN_{i})}

Where $TP_{i}$ represents True Positives for class $i$ , $FP_{i}$ represents False Positives for class $i$ , and $FN_{i}$ represents False Negatives for class $i$ .

F1 Score (Macro-Averaged)

\text{F1}_{\text{macro}}=\frac{1}{C}\sum_{i=1}^{C}\text{F1}_{i}

Where $C$ is the number of classes, $\text{F1}_{i}=2\cdot\frac{\text{Precision}_{i}\cdot\text{Recall}_{i}}{\text{%Precision}_{i}+\text{Recall}_{i}}$ , $\text{Precision}_{i}=\frac{TP_{i}}{TP_{i}+FP_{i}}$ , and $\text{Recall}_{i}=\frac{TP_{i}}{TP_{i}+FN_{i}}$ .

Matthews Correlation Coefficient (MCC)

\text{MCC}=\frac{\sum_{i}\sum_{j}(TP_{i,j}\cdot TN_{i,j}-FP_{i,j}\cdot FN_{i,j%})}{\sqrt{\prod_{i}(TP_{i}+FP_{i})(TP_{i}+FN_{i})(TN_{i}+FP_{i})(TN_{i}+FN_{i}%)}}

Where $TP_{i,j}$ represents True Positives for classes $i$ and $j$ , $TN_{i,j}$ represents True Negatives for classes $i$ and $j$ , $FP_{i,j}$ represents False Positives for classes $i$ and $j$ , and $FN_{i,j}$ represents False Negatives for classes $i$ and $j$ .