Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (2024)

Kyoka Ono  Simon A. Lee

Abstract

Recent research has explored how Language Models (LMs) can be used for feature representation and prediction in tabular machine learning tasks. This involves employing text serialization and supervised fine-tuning (SFT) techniques. Despite the simplicity of these techniques, significant gaps remain in our understanding of the applicability and reliability of LMs in this context. Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning and evaluates the feasibility of adopting similar approaches with these advanced technologies. At the data level, we investigate various methods of data representation and curation of serialized tabular data, exploring their impact on prediction performance. At the classification level, we examine whether text serialization combined with LMs enhances performance on tabular datasets (e.g. class imbalance, distribution shift, biases, and high dimensionality), and assess whether this method represents a state-of-the-art (SOTA) approach for addressing tabular machine learning challenges. Our findings reveal current pre-trained models should not replace conventional approaches.

Machine Learning, ICML

1 Introduction

In the field of natural language processing (NLP), a paradigm shift has occurred, driven by the emergence of Language Models (LM) technologies rooted in the transformer architecture (Vaswani etal., 2017). These advancements have led to immense progress across various domains of machine learning (ML) and artificial intelligence (AI). Leveraging sophisticated techniques such as transfer learning (Weiss etal., 2016) and attention mechanisms (Bahdanau etal., 2014), LMs have demonstrated exceptional capabilities in tasks encompassing language understanding (Devlin etal., 2018), translation (Lewis etal., 2019), and text generation (Radford etal., 2018), thereby significantly influencing applications within the field of NLP. However, researchers from various fields have discovered that these LMs are not limited to conventional tasks. Consequently, there has been a surge of research into other areas and domains, such as question-answering (Radford etal., 2019; Su etal., 2019) and mathematical reasoning (Trinh etal., 2024; Wang etal., 2023; Imani etal., 2023), among others.

Therefore, in this paper, we focus on the ability of LMs to solve tabular machine learning tasks as introduced by (Hegselmann etal., 2023; Sahakyan etal., 2021; Dinh etal., 2022; Fang etal., 2024). These studies utilize text serialization—converting tabular data into natural language representations—combined with supervised fine-tuning (SFT) to evaluate LMs’ capability on supervised machine learning tasks. Yet, current papers do not explore whether this process or these LMs could represent a state-of-the-art (SOTA) approach in machine learning. This oversight is especially significant in light of previous assertions that gradient boosting methods outperform deep learning strategies (Grinsztajn etal., 2022).

These previous works also did not determine whether various data curation measures are required for obtaining accurate results and how to adequately handle the common data preparation practices commonly used in tabular machine learning (e.g. missing data, feature scaling, etc.). As a result, there are open questions in the current literature about text serialization and whether they align with conventional machine learning paradigms.

In this work, we explore the unresolved questions related to text serialization. We believe this research is crucial for contrasting the differences between traditional ML methods and emerging methodologies like “text serialization” developed for LM technologies. Thus, we rigorously analyze numerous publicly available tabular datasets and detail the various experiments conducted to gain insights into the current questions in this area of research. We aim to determine whether data curation is necessary and assess whether these pre-trained LMs should be used over traditional tabular solvers like gradient boosting under varying dataset characteristics. The contributions of this paper are as follows:

  • We investigate whether open-source LMs, in conjunction with text serialization, can achieve state-of-the-art (SOTA) performance compared to current ML methods in supervised learning tasks. We aim to determine whether pre-trained models should be preferred over previously established gradient-boosted methods.

  • We investigate how various data curation strategies for text serialization, such as addressing missing values, feature importance, and feature scaling, affect prediction performance. We also consider whether these common protocols should be followed for language modeling.

  • We investigate the adaptability and generalization capabilities of LMs across different characteristics of tabular datasets that are commonly encountered in real world datasets (e.g. high dimensionality, imbalance).

  • We evaluate the robustness of LM-based models against common distribution shifts and dataset biases, examining how their pretrained parameters respond to these characteristics.

2 Related Works

2.1 Text Serialization

Text Serialization introduced by (Hegselmann etal., 2023; Dinh etal., 2022; GIDROL, ; Jaitly etal., 2023; Lee etal., 2024a) created an interface to allow an easy integration with tabular data to LMs by converting tabular data fields into a natural language representations. Since its emergence there have been numerous papers in various applications including healthcare that have adopted a similar approach (Chen etal., 2024a; Kim etal., 2024; Hegselmann etal., 2024; Belyaeva etal., 2023). Lee et al. found that text serialization, in particular, proved effective for handling categorical tabular data with a large number of classes. They observed that a natural language representation outperformed engineered features like one-hot encoding (Lee etal., 2024b). Text serialization has found application in various reasoning tasks, such as feature extraction, enabling systems to extract information from tables or databases to answer queries, as seen in Question and Answer (Q&A) scenarios (Min etal., 2024; Sui etal., 2024; Li etal., 2024).

Following this conversion from tabular to text, the resulting data can be directly input into foundation models (e.g., BERT (Devlin etal., 2018), GPT (Brown etal., 2020), etc.) to obtain rich feature representations in the form of high-fidelity vectors. Recent research has focused extensively on representing numerical data (Gorishniy etal., 2022; Golkar etal., 2023), where these foundation models have demonstrated competitive and often superior performance compared to current models like XGBoost (Chen & Guestrin, 2016) and LGBM (Ke etal., 2017), showing recent evidence against previous claims of boosted methods being the SOTA (Grinsztajn etal., 2022).

2.2 Tabular Deep Leaning

Deep learning has emerged as an exceptional computational framework across numerous disciplines due to its ability to learn complex patterns in large datasets (Zhang etal., 2018; Feng etal., 2019), generalize effectively (Sanh etal., 2021), apply transfer learning techniques (Torrey & Shavlik, 2010; Zhuang etal., 2020; Pan & Yang, 2009; Niu etal., 2020; Levin etal., 2022), and scale with powerful hardware (Mayer & Jacobsen, 2020; Chilimbi etal., 2014; Rouhani etal., 2018). Tabular deep learning has been investigated over the years, yet there remains no consensus on whether it represents the optimal modeling approach for this type of data (Shwartz-Ziv & Armon, 2022; Borisov etal., 2022; Gorishniy etal., 2021). Despite this lack of consensus, many groups continue to explore this field extensively. Examples include TabNet (Arik & Pfister, 2021), TabPFN (Hollmann etal., 2022), SAINT (Somepalli etal., 2021), TabTransformer (Huang etal., 2020), NODE (Popov etal., 2019), and TaBERT (Yin etal., 2020). Kadra et al. demonstrated that even simple neural nets can produce high-performing models compared to baselines (Kadra etal., 2021).

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (1)

More recently, there has been a resurgence of interest in tabular deep learning, driven by advancements in Language Model (LM) technology. Notably, models like TabLLM (Hegselmann etal., 2023), LIFT (Dinh etal., 2022), MEME (Lee etal., 2024b, c), and others (Zhang etal., 2023) have showcased robust performance in both few-shot and fully trained scenarios. However, when evaluating these LMs with zero or few shots, it’s challenging to determine whether they are learning the task (Webson & Pavlick, 2021) or merely hallucinating based on simpler classification tasks, which complicates model evaluation (Ji etal., 2023; Lee & Lindsey, 2024). Nevertheless, fine-tuning these language models enables them to be adapted to perform specific tasks using a few-shot (minimal data) approach (Harari & Katz, 2022; Liu etal., 2022; Perez etal., 2021; Zhao etal., 2021).

Despite recent advances in language models and tabular machine learning, numerous unanswered questions remain regarding the use of language models in this field. Therefore, this study aims to comprehensively address some knowledge gaps concerning the systematic approach to the machine learning pipeline and how these new approaches align with conventional paradigms. Additionally, we would like to highlight and address other common scenarios where pre-trained language models can be beneficial and whether these general models should be adopted and surpass previously state-of-the-art models that are primarily based on gradient boosting. We hypothesize that language models do not adhere to conventional paradigms and do not require data curation techniques, but we believe that these pre-trained models can be effective tabular solvers.

3 Methodology

3.1 Text Serialization

Problem Formulation: Text serialization is the process of transforming structured tabular data X𝑋Xitalic_X with dimensions n×m𝑛𝑚n\times mitalic_n × italic_m into textual representations. Here, n𝑛nitalic_n is the number of samples, and m𝑚mitalic_m represents the number of features. In the study by Hegselmann et al. on TabLLM, they identified that using text templates and list readouts provided the best results among various serialization strategies (Hegselmann etal., 2023). Therefore we will adopt a text template approach within our analysis. Mathematically, this transformation can be represented as follows: Let X={xij}n×m𝑋subscriptsubscript𝑥𝑖𝑗𝑛𝑚X=\{x_{ij}\}_{n\times m}italic_X = { italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n × italic_m end_POSTSUBSCRIPT be the input dataset, where xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the value of the i𝑖iitalic_i-th sample in the j𝑗jitalic_j-th feature. Let Y={yi}n𝑌subscriptsubscript𝑦𝑖𝑛Y=\{y_{i}\}_{n}italic_Y = { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the corresponding set of labels for each sample in X𝑋Xitalic_X. The goal of text serialization is to define a mapping Φ:XT:Φ𝑋𝑇\Phi:X\rightarrow Troman_Φ : italic_X → italic_T, where T={ti}n𝑇subscriptsubscript𝑡𝑖𝑛T=\{t_{i}\}_{n}italic_T = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the serialized text derived from the data in X𝑋Xitalic_X. This mapping function ΦΦ\Phiroman_Φ uses template filling to convert xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the corresponding serialized text tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. From this textual representation, we will utilize this data alongside the labels for our supervised fine-tuning in classification tasks.

3.2 Language Model Selection

In our study, we need to select a language model backbone that we will use in the study. There are numerous backbones to choose from, but we filter these by selecting a language model that was pretrained on text classification objectives. Therefore, in order to select the best Language Model (LM) for our benchmark, we will conduct an evaluation on multiple open source LMs sourced from the huggingface sequence classification library (Wolf etal., 2019). We additionally benchmark several models from the Massive Text Embedding Benchmark (MTEB) (Muennighoff etal., 2022)—a comprehensive framework designed to evaluate the performance of text embedding models across a wide range of tasks—and select models based on their rank in text classification. This is in an effort to find the LM that provides the best representation for our serialized textual data. In TableLABEL:TableModel, we highlight which LMs we evaluate with a short description describing each of them.

3.3 Current Understanding and Limitations

What do we know about Text Serialization

From the literature, there are concrete findings that have been made in text serialization. Text serialization has enabled the integration of tabular data with language models (LMs), leading to competitive performance in datasets with minimal samples (few-shot) (Hegselmann etal., 2023; Yang etal., 2024) or no samples at all (zero-shot) (Wei etal., 2021; Kojima etal., 2022; Zhong etal., 2021). This success is due to converting data into a natural language format, which allows for the effective application of transfer learning using hundreds of millions of pre-trained parameters within a LM to carry out inference. While recent works have progressed towards the ability to read structured data (Song etal., 2023; Chen etal., 2024b; Yao etal., 2023), text serialization appears to remain the best method for integrating tabular data with LMs. Another use case of text serialization was identified when tabular data has categorical data with a high number of classes or heterogenous data (numerical, categorical, free text) within the tabular fields (Lee etal., 2024b). This methodology allows us to seamlessly preserve all the data in its natural form (no feature engineering necessary), represented all as text. Groups including (Belyaeva etal., 2023; Chen etal., 2024a) also demonstrated that text serialization was particularly effective when integrated with paired multimodal datasets, enabling contrastive methods to shared latent representations (Radford etal., 2021).

What needs to be addressed

While considerable progress has been made in advancing tabular data with LM technologies, many intermediate steps at both the data and classification levels remain undisclosed. This paper aims to address some of the key gaps in the current literature, providing a more comprehensive understanding of the existing challenges and solutions.

Data Questions:

Many questions remain regarding whether text serialization or LMs adhere to similar approaches as those of traditional machine learning paradigms. This is particularly relevant in the data curation process when handling raw data that contains missing values, the need to identify important and unimportant features, and dealing with differently distributed numerical data. Applying data curation is often a crucial component in traditional machine learning pipelines, but no study has yet examined whether similar approaches are required in LM technologies for supervised tasks. A visualization of this exploration can be seen in Figure 1.

Classification Questions:

In addition, there have been no studies regarding whether pre-trained LMs should be used for all tabular supervised classification tasks. Therefore, we explore several datasets with commonly encountered characteristics and benchmark them against various tabular SOTA models and traditional machine learning methods. We aim to determine whether LMs support or contradict previous claims that gradient boosting performs better than deep learning-based models in tabular tasks (Grinsztajn etal., 2022).

4 Experimental Setup

4.1 Data

In our study, we utilize eight datasets, which we divide into two groups.

DatasetSample Size (n𝑛nitalic_n)# of Features (m𝑚mitalic_m)BinaryIris1501501501504444Diabetes7848Titanic \heartsuit89189189189111111111Wine17817817817813131313HELOC10,4591045910,45910 , 45923232323Fraud284,807284807284,807284 , 80730303030Crime878,049878049878,049878 , 0498888Cancer80180180180120,5332053320,53320 , 533

Baseline Datasets:

The first four datasets are commonly used baselines in tabular machine learning. These datasets include the IRIS, Wine, Diabetes, and Titanic Dataset, which are either binary or multiclass (3 class) classification problems sourced from the UCI data repository or previous literature (Asuncion & Newman, 2007; Smith etal., 1988). We utilize these baseline datasets in our data-level experiments to identify which preprocessing steps affect relative performance and should be adapted for our SOTA experiments.

Experimental Datasets:

The second group of datasets can be labeled as a set of datasets with interesting and common machine learning characteristics. We utilize these datasets only in our SOTA evaluation by using the identified preprocessing steps from our previous experiments. These datasets include an Identifying Targets for Cancer Using Gene Expression Profiles dataset, which includes high dimensionality (Fiorini, 2016); the HELOC Dataset (Brown etal., 2018), which contains well documented distribution shift identified by (Gardner etal., 2023); the San Francisco Crime dataset, which contains inherent biases towards certain neighborhoods (Asuncion & Newman, 2007); and the Credit Card Fraud dataset, which contains class imbalance (DalPozzolo etal., 2015) (0.172% of data is fraud). These datasets contain a mixture of binary and multi-class classification tasks. All characteristics to the datasets including sample size and feature size can be found in Table 1. Additional details about what the raw data looks like and how we serialized it in different ways can be found in Appendix Section B.

4.2 Experiments

At the data level, we’ve identified gaps in the literature related to data curation for text serialization and whether they follow approaches similar to traditional machine learning paradigms. To explore the effects of various preprocessing measures on serialized tabular data, we utilize a baseline model, where no data curation is performed. We then explore how applying different preprocessing techniques, affect performance relative to the baseline.

Additionally, at the classification level, we are interested in testing the robustness of LMs when faced with commonly encountered real-life dataset characteristics. With the identified requisite from our data curation experiments, we evaluate the LMs against existing methods and commonly used ML methods on datasets that exhibit class imbalance, distribution shift among other. By introducing these challenges into our benchmark datasets, we aim to evaluate the relative performance of LMs in tackling fundamentally difficult challenges in tabular machine learning. We detail our experiments in greater detail in the proceeding subsections.

Data Experiments

Feature Selection

Feature selection is the process of identifying and selecting a subset of relevant features from the original set of features to improve model performance and efficiency (Guyon & Elisseeff, 2003). In our first experiment, we compare a baseline model, where no feature selection is applied, to a model where feature selection is utilized. We employ two feature selection methods: one using SHapley Additive exPlanations (SHAP) values extracted from an XGBoost model and another using the ANOVA F-test (St etal., 1989). Further details on how these features are derived from these methods can be found in Appendix Section E. We then assess whether feature selection yields better, worse, or nuanced results. We further include serialized text in the appendix to give readers a view what these sentences look like with and without feature selection attached in Appendix Section B.

Feature Scaling & Outlier Handling

Feature scaling involves converting features within a dataset to ensure they are on a similar scale, thus preventing certain features from dominating others in the analysis. We explore standardizing features (subtract the mean (μ𝜇\muitalic_μ) and divide by the standard deviation (σ𝜎\sigmaitalic_σ)) when they are on different scales and the machine learning algorithm is scale-sensitive. We normalize features (rescale to the range [0, 1]) to bring all features to a common range, particularly in the presence of outliers. Additionally, we apply log transformation when the data is skewed or contains outliers, as it can help mitigate the impact of extreme values and make the distribution more normal. These measures are applied to the Titanic (Eaton & Haas, 1995) datasets based on the characteristics of its dataset, and we report whether such steps are necessary.

Missing Data Handling & Imputation

Missing data handling and imputation involve techniques for addressing and filling in missing values within a dataset to ensure completeness and maintain the integrity of the analysis. Unlike in traditional tabular machine learning, a clear method for handling missing values remains unclear. Therefore, we explore the effects of ignoring missing values (equivalent to dropping that single cell) and adding filler sentence techniques as a form of imputation, similar to those described in (Lee etal., 2024b). We will then perform a sensitivity analysis observing how much the logarithm of odds (logits) for each class change based on these imputation strategies.

Classification Experiments

SOTA Benchmarks on Various Tabular Datasets

In our classification experiments we are particularly interested in seeing how LM perform compared to traditional Machine learning models, and several models from the literature. We test for SOTA in all the baseline datasets as well as our experimental datasets referenced in Section 4.1. These include datasets with high dimensionality, distribution shift, bias, and class imbalance. We don’t perform data corrections (e.g. SMOTE, etc.), and instead want to assess the performance with these included characteristics.

4.3 Benchmarking Baseline Models

To evaluate the relative performance of text serialization and SFT we identify models commonly used for tabular machine learning to include in the benchmark that have excelled at tabular tasks. The models included were sourced from (Dinh etal., 2022; Hegselmann etal., 2023) in the evaluation include Support Vector Machines (SVM) with the Radial basis function (RBF) kernel (Cortes & Vapnik, 1995), Light Gradient boosted machines (Ke etal., 2017), and XGBoost (Chen & Guestrin, 2016). From the literature we also use Tabnet (Arik & Pfister, 2021) and TabPFN (Hollmann etal., 2022) which were optimized on tabular tasks. The metrics we will use to evaluate models include: f1, accuracy, Area Under the Receiver Operating Characteristic (AUROC), and mathew’s correlation coefficient (MCC) (Chicco & Jurman, 2020). When classification objectives are not binary we include the macro averaging strategy to create a uniform view of performance metrics across all methods.

4.4 Training and Model Optimization

In terms of optimizing the language model performance, we elect for a standard learning rate of 2e-4 with a learning rate scheduler to tune this parameter dynamically. We also include a dropout of 0.3 to ensure that these models are not overfitting during fine tuning. We elect for a batch size of 64 on each dataset. We minimize on the Binary Cross Entropy loss for binary classification, and Cross-entropy for multi-class classification. We evaluate our models using Pytorch and use LMs sourced from huggingface. We do all evaluations on a single Tesla V100 GPU with 16GB of VRAM.

In the standard machine learning models, we elect to conduct a five-fold cross-validation grid search to find optimal hyperparameters for the benchmark. We showcase these hyperparameters we searched for in the appendix for reproducibility purposes (Appendix Section D).

5 Results

5.1 Language Model Evaulation

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (2)

We begin our analysis by identifying our “TabLM” through a benchmarking study on a set of language models sourced from the Huggingface sequence classification and Massive Text Embedding Benchmark (MTEB) (Muennighoff etal., 2022). We conducted this analysis using the Titanic baseline dataset and used serialized text template inputs to conduct a proper evaluation. From Figure 2, we find that DistilBERT is the best performing model, and we elect this to be our Tabular Language Model, which we will refer to as TabLM. One notable finding from this evaluation is that the MTEB’s ranking of text classification is not compatible with tabular machine learning tasks, as evidenced by standard models outperforming the General Text Embedding (GTE) model (Li etal., 2023). Additionally, the varying fluctuations across performance metrics illustrate how different pre-training objectives incorporated within these foundational models may optimize different performance metrics. Further details are located in (Section C.1).

5.2 Data Curation Results

Feature Selection

In our feature selection experiment, we compare the performance between a baseline language model (LM) without feature selection and an LM that uses shorter serialized sentences containing only important features. These features are identified through XGBoost feature importance and visualized using SHapley Additive exPlanations (SHAP) values and ANOVA F-tests.

DatasetWithout Feature SelectionWith Feature SelectionImproved?MetricsAUROCF1AUROCF1Iris1.0001.0001.0001.000Wine0.9520.9440.9760.972Diabetes0.6540.6210.6590.659Titanic \heartsuit0.7860.8710.7770.852

This study reveals that feature selection appears to have a positive effect on both F1 score and AUROC in most evaluation datasets, as seen in Table 2. While the results are somewhat nuanced, we observe that selecting appropriate features for serialization tends to enhance performance in classification tasks and will likely be true in datasets with higher dimensionality.

Feature Scaling & Outlier Handling

In our experiment on feature scaling and outlier handling, we benchmark models that serialize their numerical data using various feature scaling methods to compare their performance across multiple metrics. This evaluation specifically focuses on the Titanic dataset, which exhibits right-skewed distributions in both the fare and age features. To address these issues, we employ standardization, normalization, and logarithmic transformations on these features, applying corrections that offer different benefits as detailed in Section 4. Each method is analyzed for its effectiveness in mitigating the impact of skewness and improving model performance, providing a comprehensive understanding of how feature scaling can influence key performance indicators.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (3)

Our analysis reveals nuanced results, where various feature scaling methods yield marginal gains and deficits. Based on these results, we advise that scaling methods should be applied in accordance with the classification objectives. However, it is also likely possible to achieve acceptable results without performing feature scaling.


State of the Art Evaluation - Baseline DatasetsDatasetMethodAccuracyF1AUROCMCCCurrent State of the ArtTabLM SOTA?IrisSVM (RBF)1.00001.00001.00001.18701.0000 (Acc)(Ojha & Nicosia, 2020)LGBM1.00001.00001.00001.1870XGBoost1.00001.00001.00001.1870TabNet1.00001.00001.00001.1870TabPFN1.00001.00001.1870TabLM1.00001.00001.00001.1870WineSVM (RBF)0.83330.81070.94141.20040.9800 (Acc) (Di etal., 2020)LGBM1.00001.00001.00001.2089XGBoost0.97220.96631.00001.2133TabNet0.83330.84970.95030.7306TabPFN0.98000.97850.9704TabLM0.97220.97611.00001.2147DiabetesSVM (RBF)0.76620.74110.80440.48330.7879 (Acc) (Sarkar, 2022)LGBM0.75320.73340.81290.4671XGBoost0.75970.73010.82350.4640TabNet0.72730.62500.85250.4329TabPFN0.76620.74330.82110.4870TabLM0.64230.65940.65930.3962Titanic\heartsuitSVM (RBF)0.77650.76870.86540.53760.7985 (Acc) (Sarkar, 2022)LGBM0.78770.77470.89950.5572XGBoost0.79890.78890.89580.5812TabNet0.82120.76120.89380.6192TabPFN0.81010.73440.47470.5923TabLM0.82120.77770.85210.6001

Handling Missing Data & Imputation

Lastly in our experiments regarding missing data handling, we evaluated a baseline model that ignores missing values by not serializing any text (equivalent to dropping that cell in tabular data). We then tested two strategies for imputing filler sentences into serialized data. The first strategy (Model: Impute 1) used a sentence that had no relevance to the classification objective, while the second (Model: Impute 2) used a filler sentence related to the classification objectives. We analyzed the differences denoted as ΔΔ\Deltaroman_Δ in the logirthm of odds (logits) by subtracting the logits of the two imputed models from the baseline logits to assess how the logits for each class were affected by the imputation. Logits centered at the origin (0,0) indicated that they were typically unaltered, whereas logits that deviated from the origin were heavily altered.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (4)

Our results, displayed in Figure 4, reveal that imputing sentences similar to those seen in (Lee etal., 2024b) should be used with caution, as they appear to cause significant changes in ΔΔ\Deltaroman_Δ, potentially altering the final class prediction. This could lead to learning the distribution of the imputed data, which can greatly affect performance, particularly if there is a substantial amount of missing data within a specific feature.


State of the Art Evaluation - Experimental DatasetsDatasetMethodAccuracyF1AUROCMCCCurrent State of the ArtTabLm SOTAHELOC \daggerSVM (RBF)0.72230.72070.79030.4426N/ALGBM0.72800.72670.79580.4541XGBoost0.71700.71570.77460.4321TabNet0.72750.70700.79660.4532TabPFN0.7500*0.7253*0.4519*0.5014*TabLM0.71570.70250.79390.4331Fraud \diamondsuitSVM (RBF)0.99830.49960.47900.00000.9530 (AUROC) (Xu etal., 2023)LGBM0.99940.90750.90830.8167XGBoost0.99960.92930.98110.8635TabNet0.99940.82180.96400.8215TabPFN*TabLM0.99880.92110.91550.8545Crime\clubsuitSVM (RBF)0.20060.00880.48490.2310N/ALGBM0.26360.07640.62910.2395XGBoost0.26060.07560.64670.2389TabNet0.30870.05020.71930.2097TabPFN*TabLM0.32120.06710.67890.2437Cancer\spadesuitSVM (RBF)1.00001.00001.00001.1428N/ALGBM1.00001.00001.00001.1428XGBoost1.00001.00001.00001.1428TabNet0.98140.97350.99940.9749TabPFN*TabLM0.98330.98260.98640.9792

5.3 SOTA Benchmark

Having identified the preprocessing steps that are generally beneficial to language models and text serialization, we now proceed with a comprehensive benchmark across all baseline and experimental datasets. This benchmark compares our TabLM against traditional ML algorithms and two specific algorithms from recent literature: Tabnet (Arik & Pfister, 2021) and TabPFN 111WARNING: TabPFN is not suitable for datasets with training sizes above 1024 and feature sizes above 10. Predictions become slower and less reliable as dataset size increases. The authors advise against using TabPFN for datasets with over 10k samples due to potential machine crashes from quadratic memory scaling. Consequently, we do not include evaluations as a result on the Crime, Cancer, and Fraud classification datasets.(Hollmann etal., 2022). We also include current state-of-the-art models identified by competitions and the open web to showcase what the actual highest metric is. To this end, we introduce a separate column in our benchmarks to highlight these methods as well as their winning performance metric.

6 Discussion

6.1 Language Models benefit from Feature Selection

From our study on data curation, we identified that among the three techniques, feature selection was the only beneficial data curation strategy. Other strategies, such as feature scaling and handling missing data, showed negative or nuanced results, suggesting that their inclusion could lead to adverse outcomes. Therefore, based on our findings, we advise researchers who use language models on tabular tasks to apply these data curation techniques with caution. We therefore believe more work has to be done in identifying appropriate serialization strategies.

6.2 Serialization Sensitivity

Previous studies (Hegselmann etal., 2023) and our experiments with imputation indicate that the logarithm of odds is highly sensitive to minor modifications in the serialized text. Hegselmann et al. found that list readouts and text templates were the most effective serialization strategies. However, our analysis suggests that engineering the input text could significantly enhance or reduce the performance of various language models in classification tasks.

6.3 When do I use LM For Tabular tasks?

From this evaluation, it is not conclusively evident that traditional ML techniques or neural network models designed for tabular tasks should be replaced by emerging language model (LM) techniques. These language models were not optimized for tabular tasks, and it appears challenging to fine-tune these models without large datasets. This is evident in our baseline experiments where all the datasets had sample sizes less than 1000. This situation is analogous to other deep learning methodologies that require substantial data to tune the large number of parameters and are at risk of overfitting to the training set. Regarding the experimental datasets with larger sample sizes, it also appears that pre-training and transfer learning offer little benefit to these tasks and do not enhance predictive performance.

Therefore, while our TabLM model reached SOTA accuracy levels for specific tasks, other methodologies often yielded more robust results across the board. This finding suggests that these models may not be universally suitable for tabular tasks. However, these models were still competitive, despite not always achieving SOTA performance levels. Extensive research is ongoing to optimize LMs and more recently LLMs for performing tasks on structured data. However, we believe that pre-trained language models should not replace conventional models, and we support the notion that traditional models are still better suited for tabular tasks than deep learning methods (Grinsztajn etal., 2022).

7 Conclusion

In this study, we conducted a series of experiments related to text serialization and compared them to traditional machine learning paradigms. We assessed how various preprocessing steps could enhance or diminish the performance of models. We also performed an benchmarking evaluation against traditional ML models, and two tabular deep learning models and found that pre-trained language models are not better than these exisiting methods. We therefore conclude that pre-trained models are not better than gradient boosted methods.

Code and Data

All code can be found in the Github. All data is in Appendix Section B.

7.1 Impact Statement

This work aims to advance Tabular Machine Learning by comparing modern NLP language models (LMs) with traditional paradigms. While not covering all aspects of text serialization and tabular characteristics, the study reveals a generally analogous behavior across the evaluated models.

References

  • Aeberhard & Forina (1991)Aeberhard, S. and Forina, M.Wine.UCI Machine Learning Repository, 1991.DOI: https://doi.org/10.24432/C5PC7J.
  • Arik & Pfister (2021)Arik, S.Ö. and Pfister, T.Tabnet: Attentive interpretable tabular learning.In Proceedings of the AAAI conference on artificial intelligence, volume35, pp. 6679–6687, 2021.
  • Asuncion & Newman (2007)Asuncion, A. and Newman, D.Uci machine learning repository, 2007.
  • Bahdanau etal. (2014)Bahdanau, D., Cho, K., and Bengio, Y.Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.
  • Beltagy etal. (2020)Beltagy, I., Peters, M.E., and Cohan, A.Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020.
  • Belyaeva etal. (2023)Belyaeva, A., Cosentino, J., Hormozdiari, F., Eswaran, K., Shetty, S., Corrado, G., Carroll, A., McLean, C.Y., and Furlotte, N.A.Multimodal llms for health grounded in individual-specific data.In Workshop on Machine Learning for Multimodal Healthcare Data, pp. 86–102. Springer, 2023.
  • Borisov etal. (2022)Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Brown etal. (2018)Brown, K., Doran, D., Kramer, R., and Reynolds, B.Heloc applicant risk performance evaluation by topological hierarchical decomposition.arXiv preprint arXiv:1811.10658, 2018.
  • Brown etal. (2020)Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen etal. (2024a)Chen, E., Kansal, A., Chen, J., Jin, B.T., Reisler, J., Kim, D.E., and Rajpurkar, P.Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine.Advances in Neural Information Processing Systems, 36, 2024a.
  • Chen & Guestrin (2016)Chen, T. and Guestrin, C.Xgboost: A scalable tree boosting system.In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
  • Chen etal. (2024b)Chen, W., Yuan, C., Yuan, J., Su, Y., Qian, C., Yang, C., Xie, R., Liu, Z., and Sun, M.Beyond natural language: Llms leveraging alternative formats for enhanced reasoning and communication.arXiv preprint arXiv:2402.18439, 2024b.
  • Chicco & Jurman (2020)Chicco, D. and Jurman, G.The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation.BMC genomics, 21:1–13, 2020.
  • Chilimbi etal. (2014)Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K.Project adam: Building an efficient and scalable deep learning training system.In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp. 571–582, 2014.
  • Clark etal. (2020)Clark, K., Luong, M.-T., Le, Q.V., and Manning, C.D.Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555, 2020.
  • Cortes & Vapnik (1995)Cortes, C. and Vapnik, V.Support-vector networks.Machine learning, 20:273–297, 1995.
  • DalPozzolo etal. (2014)DalPozzolo, A., Caelen, O., LeBorgne, Y.-A., Waterschoot, S., and Bontempi, G.Learned lessons in credit card fraud detection from a practitioner perspective.Expert systems with applications, 41(10):4915–4928, 2014.
  • DalPozzolo etal. (2015)DalPozzolo, A., Caelen, O., Johnson, R.A., and Bontempi, G.Calibrating probability with undersampling for unbalanced classification.In 2015 IEEE symposium series on computational intelligence, pp. 159–166. IEEE, 2015.
  • DalPozzolo etal. (2017)DalPozzolo, A., Boracchi, G., Caelen, O., Alippi, C., and Bontempi, G.Credit card fraud detection: a realistic modeling and a novel learning strategy.IEEE transactions on neural networks and learning systems, 29(8):3784–3797, 2017.
  • Devlin etal. (2018)Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
  • Di etal. (2020)Di, X., Yu, P., Bu, R., and Sun, M.Mutual information maximization in graph neural networks.In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE, 2020.
  • Dinh etal. (2022)Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., Sohn, J.-y., Papailiopoulos, D., and Lee, K.Lift: Language-interfaced fine-tuning for non-language machine learning tasks.Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
  • Eaton & Haas (1995)Eaton, J.P. and Haas, C.Titanic: Triumph and tragedy.WW Norton & Company, 1995.
  • Fang etal. (2024)Fang, X., Xu, W., Tan, F.A., Zhang, J., Hu, Z., Qi, Y., Nickleach, S., Socolinsky, D., Sengamedu, S., and Faloutsos, C.Large language models(llms) on tabular data: Prediction, generation, and understanding – a survey, 2024.
  • Feng etal. (2019)Feng, S., Chen, Q., Gu, G., Tao, T., Zhang, L., Hu, Y., Yin, W., and Zuo, C.Fringe pattern analysis using deep learning.Advanced photonics, 1(2):025001–025001, 2019.
  • Fiorini (2016)Fiorini, S.gene expression cancer RNA-Seq.UCI Machine Learning Repository, 2016.DOI: https://doi.org/10.24432/C5R88H.
  • Fisher (1988)Fisher, R.A.Iris.UCI Machine Learning Repository, 1988.DOI: https://doi.org/10.24432/C56C76.
  • Gardner etal. (2023)Gardner, J., Popovic, Z., and Schmidt, L.Benchmarking distribution shift in tabular data with tableshift.Advances in Neural Information Processing Systems, 2023.
  • (29)GIDROL, J.-B.Text classification with llms.
  • Golkar etal. (2023)Golkar, S., Pettee, M., Eickenberg, M., Bietti, A., Cranmer, M., Krawezik, G., Lanusse, F., McCabe, M., Ohana, R., Parker, L., etal.xval: A continuous number encoding for large language models.arXiv preprint arXiv:2310.02989, 2023.
  • Gorishniy etal. (2021)Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A.Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  • Gorishniy etal. (2022)Gorishniy, Y., Rubachev, I., and Babenko, A.On embeddings for numerical features in tabular deep learning.Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
  • Grinsztajn etal. (2022)Grinsztajn, L., Oyallon, E., and Varoquaux, G.Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022.
  • Guyon & Elisseeff (2003)Guyon, I. and Elisseeff, A.An introduction to variable and feature selection.Journal of machine learning research, 3(Mar):1157–1182, 2003.
  • Harari & Katz (2022)Harari, A. and Katz, G.Few-shot tabular data enrichment using fine-tuned transformer architectures.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1577–1591, 2022.
  • He etal. (2020)He, P., Liu, X., Gao, J., and Chen, W.Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654, 2020.
  • Hegselmann etal. (2023)Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D.Tabllm: Few-shot classification of tabular data with large language models.In International Conference on Artificial Intelligence and Statistics, pp. 5549–5581. PMLR, 2023.
  • Hegselmann etal. (2024)Hegselmann, S., Shen, S.Z., Gierse, F., Agrawal, M., Sontag, D., and Jiang, X.A data-centric approach to generate faithful and high quality patient summaries with large language models.arXiv preprint arXiv:2402.15422, 2024.
  • Hollmann etal. (2022)Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F.Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848, 2022.
  • Huang etal. (2020)Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z.Tabtransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678, 2020.
  • Imani etal. (2023)Imani, S., Du, L., and Shrivastava, H.Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023.
  • Jaitly etal. (2023)Jaitly, S., Shah, T., Shugani, A., and Grewal, R.S.Towards better serialization of tabular data for few-shot classification with large language models, 2023.
  • Ji etal. (2023)Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.Towards mitigating llm hallucination via self reflection.In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1827–1843, 2023.
  • Kadra etal. (2021)Kadra, A., Lindauer, M., Hutter, F., and Grabocka, J.Well-tuned simple nets excel on tabular datasets.Advances in neural information processing systems, 34:23928–23941, 2021.
  • Kan (2015)Kan, W.San francisco crime classification, 2015.URL https://kaggle.com/competitions/sf-crime.
  • Kaplan etal. (2020)Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020.
  • Ke etal. (2017)Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y.Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017.
  • Kim etal. (2024)Kim, Y., Xu, X., McDuff, D., Breazeal, C., and Park, H.W.Health-llm: Large language models for health prediction via wearable sensor data.arXiv preprint arXiv:2401.06866, 2024.
  • Kojima etal. (2022)Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., and Iwasawa, Y.Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022.
  • Lan etal. (2019)Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R.Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019.
  • Lee & Lindsey (2024)Lee, S.A. and Lindsey, T.Do large language models understand medical codes?arXiv preprint arXiv:2403.10822, 2024.
  • Lee etal. (2024a)Lee, S.A., Brokowski, T., and Chiang, J.N.Enhancing antibiotic stewardship using a natural language approach for better feature representation.arXiv preprint arXiv:2405.20419, 2024a.
  • Lee etal. (2024b)Lee, S.A., Jain, S., Chen, A., Biswas, A., Fang, J., Rudas, A., and Chiang, J.N.Multimodal clinical pseudo-notes for emergency department prediction tasks using multiple embedding model for ehr (meme).arXiv preprint arXiv:2402.00160, 2024b.
  • Lee etal. (2024c)Lee, S.A., Jain, S., Chen, A., Ono, K., Fang, J., Rudas, A., and Chiang, J.N.Emergency department decision support using clinical pseudo-notes, 2024c.
  • Levin etal. (2022)Levin, R., Cherepanova, V., Schwarzschild, A., Bansal, A., Bruss, C.B., Goldstein, T., Wilson, A.G., and Goldblum, M.Transfer learning with deep tabular models.arXiv preprint arXiv:2206.15306, 2022.
  • Lewis etal. (2019)Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L.Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
  • Li etal. (2024)Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Geng, R., Huo, N., etal.Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024.
  • Li etal. (2023)Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., and Zhang, M.Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023.
  • Liu etal. (2022)Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C.A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning.Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  • Liu etal. (2019)Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
  • Mayer & Jacobsen (2020)Mayer, R. and Jacobsen, H.-A.Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools.ACM Computing Surveys (CSUR), 53(1):1–37, 2020.
  • Min etal. (2024)Min, D., Hu, N., Jin, R., Lin, N., Chen, J., Chen, Y., Li, Y., Qi, G., Li, Y., Li, N., etal.Exploring the impact of table-to-text methods on augmenting llm-based question answering with domain hybrid data.arXiv preprint arXiv:2402.12869, 2024.
  • Muennighoff etal. (2022)Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022.
  • Niu etal. (2020)Niu, S., Liu, Y., Wang, J., and Song, H.A decade survey of transfer learning (2010–2020).IEEE Transactions on Artificial Intelligence, 1(2):151–166, 2020.
  • Ojha & Nicosia (2020)Ojha, V. and Nicosia, G.Multi-objective optimisation of multi-output neural trees.In 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE, 2020.
  • Pan & Yang (2009)Pan, S.J. and Yang, Q.A survey on transfer learning.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  • Perez etal. (2021)Perez, E., Kiela, D., and Cho, K.True few-shot learning with language models.Advances in neural information processing systems, 34:11054–11070, 2021.
  • Popov etal. (2019)Popov, S., Morozov, S., and Babenko, A.Neural oblivious decision ensembles for deep learning on tabular data.arXiv preprint arXiv:1909.06312, 2019.
  • Radford etal. (2018)Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., etal.Improving language understanding by generative pre-training.2018.
  • Radford etal. (2019)Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • Radford etal. (2021)Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  • Rouhani etal. (2018)Rouhani, B.D., Riazi, M.S., and Koushanfar, F.Deepsecure: Scalable provably-secure deep learning.In Proceedings of the 55th annual design automation conference, pp. 1–6, 2018.
  • Sahakyan etal. (2021)Sahakyan, M., Aung, Z., and Rahwan, T.Explainable artificial intelligence for tabular data: A survey.IEEE access, 9:135392–135422, 2021.
  • Sanh etal. (2019)Sanh, V., Debut, L., Chaumond, J., and Wolf, T.Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019.
  • Sanh etal. (2021)Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A., etal.Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207, 2021.
  • Sarkar (2022)Sarkar, T.Xbnet: An extremely boosted neural network.Intelligent Systems with Applications, 15:200097, 2022.
  • Shwartz-Ziv & Armon (2022)Shwartz-Ziv, R. and Armon, A.Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, 2022.
  • Smith etal. (1988)Smith, J.W., Everhart, J.E., Dickson, W., Knowler, W.C., and Johannes, R.S.Using the adap learning algorithm to forecast the onset of diabetes mellitus.In Proceedings of the annual symposium on computer application in medical care, pp. 261. American Medical Informatics Association, 1988.
  • Somepalli etal. (2021)Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., and Goldstein, T.Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342, 2021.
  • Song etal. (2023)Song, Y., Xiong, W., Zhu, D., Li, C., Wang, K., Tian, Y., and Li, S.Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023.
  • St etal. (1989)St, L., Wold, S., etal.Analysis of variance (anova).Chemometrics and intelligent laboratory systems, 6(4):259–272, 1989.
  • Su etal. (2019)Su, D., Xu, Y., Winata, G.I., Xu, P., Kim, H., Liu, Z., and Fung, P.Generalizing question answering system with pre-trained language model fine-tuning.In Proceedings of the 2nd workshop on machine reading for question answering, pp. 203–211, 2019.
  • Sui etal. (2024)Sui, Y., Zhou, M., Zhou, M., Han, S., and Zhang, D.Table meets llm: Can large language models understand structured table data? a benchmark and empirical study.In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp. 645–654, 2024.
  • Torrey & Shavlik (2010)Torrey, L. and Shavlik, J.Transfer learning.In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI global, 2010.
  • Trinh etal. (2024)Trinh, T.H., Wu, Y., Le, Q.V., He, H., and Luong, T.Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024.
  • Vaswani etal. (2017)Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • Wang etal. (2023)Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., Zhang, R., Song, L., Zhan, M., and Li, H.Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023.
  • Webson & Pavlick (2021)Webson, A. and Pavlick, E.Do prompt-based models really understand the meaning of their prompts?arXiv preprint arXiv:2109.01247, 2021.
  • Wei etal. (2021)Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., and Le, Q.V.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021.
  • Weiss etal. (2016)Weiss, K., Khoshgoftaar, T.M., and Wang, D.A survey of transfer learning.Journal of Big data, 3:1–40, 2016.
  • Wolf etal. (2019)Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., etal.Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019.
  • Xu etal. (2023)Xu, H., Pang, G., Wang, Y., and Wang, Y.Deep isolation forest for anomaly detection.IEEE Transactions on Knowledge and Data Engineering, 2023.
  • Yang etal. (2024)Yang, Y., Mishra, S., Chiang, J.N., and Mirzasoleiman, B.Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models.arXiv preprint arXiv:2403.07384, 2024.
  • Yang etal. (2019)Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., and Le, Q.V.Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32, 2019.
  • Yao etal. (2023)Yao, L., Zhang, Y., Yan, Z., and Tian, J.Sai: Solving ai tasks with systematic artificial intelligence in communication network.arXiv preprint arXiv:2310.09049, 2023.
  • Yin etal. (2020)Yin, P., Neubig, G., Yih, W.-t., and Riedel, S.Tabert: Pretraining for joint understanding of textual and tabular data.arXiv preprint arXiv:2005.08314, 2020.
  • Zhang etal. (2023)Zhang, H., Wen, X., Zheng, S., Xu, W., and Bian, J.Towards foundation models for learning on tabular data.arXiv preprint arXiv:2310.07338, 2023.
  • Zhang etal. (2018)Zhang, P., Liu, S., Chaurasia, A., Ma, D., Mlodzianoski, M.J., Culurciello, E., and Huang, F.Analyzing complex single-molecule emission patterns with deep learning.Nature methods, 15(11):913–916, 2018.
  • Zhao etal. (2021)Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh, S.Calibrate before use: Improving few-shot performance of language models.In International conference on machine learning, pp. 12697–12706. PMLR, 2021.
  • Zhong etal. (2021)Zhong, R., Lee, K., Zhang, Z., and Klein, D.Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections.arXiv preprint arXiv:2104.04670, 2021.
  • Zhuang etal. (2020)Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q.A comprehensive survey on transfer learning.Proceedings of the IEEE, 109(1):43–76, 2020.

Appendix

In the appendix we cover the following sections:

  • Section A: Supplementary Section

  • Section B: Datasets

  • Section C: Foundation Models Table

  • Section D: Hyperparameters of ML Models

  • Section E: Feature Selection Methods

  • Section F: Metrics

Appendix A Supplementary Section

Limitations

One notable limitation of language models in tabular tasks is that they are computationally demanding and costly in terms of runtime compared to methods such as SVMs and gradient boosting. A graphic illustrating runtime at inference is shown in Figure 5. Another limitation of language model technologies is the accessibility and lack of inclusivity they create due to their computational demands. We acknowledge that not all groups have access to GPU hardware, which represents a significant barrier in this field of work. In this study we therefore elected to use small LM over recent large language models (LLM) due to their the ability to run a local instance without the need for advanced hardware, and the reproduciblility of this work.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (5)

Another notable limitation of this work was highlighted in Section 6.2. Specifically, serialized sentences appear to heavily influence the raw prediction probabilities. While the premise of TabLLM (Hegselmann etal., 2023) explored this issue, our research, combined with theirs, still leaves lingering questions about the appropriate strategy for text serialization.

Future Works:

Future work should focus on exploring their scalability and performance as the number of parameters increases (Kaplan etal., 2020). As LLMs grow in populatity and size, they demonstrate enhanced capabilities in understanding and producing SOTA performance, but this also introduces challenges related to accessibility to computational resources, and model optimization.

Appendix B Datasets

B.1 Baseline Datasets

Iris Dataset

The Iris dataset (Fisher, 1988) is a classic dataset in the field of machine learning and statistics, often used for benchmarking classification algorithms. It consists of 150 samples divided equally among three species of Iris flowers: Iris setosa, Iris versicolor, and Iris virginica. Each sample in the dataset is described by four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. These features are used to predict the species of the Iris flower, making it a multiclass classification problem. The dataset is well-balanced, with 50 samples from each species, providing a clear example for exploring and demonstrating the capabilities of various classification techniques, from simple linear models to more complex, nonlinear classifiers.

sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)label5.13.51.40.204.93.01.40.204.73.21.30.20

{mdframed}

Serialized Text:
The Iris has sepal Length is 5.1 centimeters. Sepal width is 3.5 centimeters. Petal length is 1.4 centimeters. Petal width is 0.2 centimeters.

Wine Datset

The Wine (Aeberhard & Forina, 1991) dataset is a well-regarded dataset in the machine learning community, commonly used to evaluate multiclass classification algorithms. It comprises 178 instances from three different types of Italian wine: Barolo, Grignolino, and Barbera, derived from the Piedmont region. The dataset is characterized by thirteen attributes, including alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline. These attributes are chemically significant and contribute to differentiating one type of wine from another. The objective is to classify each wine into one of the three categories based on its chemical makeup, making it a typical example of a multiclass classification problem.

alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesprolinelabel14.21.72.415.6127.02.83.10.32.35.61.03.91065.0013.21.82.111.2100.02.62.80.31.34.41.13.41050.0013.22.42.718.6101.02.83.20.32.85.71.03.21185.00

{mdframed}

Serialized Text:
My wine has an Alcohol percentage of 14.2%. The Malic Acid is 1.7 grams per liter. Ash is 2.4 grams per liter. Alcalinity of ash is 15.6 pH. Magnesium is 127 milligrams per liter. Total Phenols is 2.8 milligrams per liter. Flavanoids is 3.1 milligrams per liter. Nonflavanoid phenols is 0.3 milligrams per liter. Proanthocyanins is 2.3 milligrams per liter. Color intensity is 5.6. Hue is 1.0. OD280/OD315 of diluted wines is 3.9. Proline is 1065."

FeatureScore
Alcohol99.18
Malic Acid33.47
Ash11.16
Alkalinity of Ash28.68
Magnesium5.52
Total Phenols78.24
Flavanoids272.00
Nonflavanoid Phenols26.65
Proanthocyanins25.28
Color Intensity101.34
Hue85.70
OD280/OD315 of Diluted Wines175.80
Proline151.48

{mdframed}

Feature Selected Serialized Text:
My wine has an Alcohol percentage of 14.2%. The Malic Acid is 1.7 grams per liter. Ash is 2.4 grams per liter. Total Phenols is 2.8 milligrams per liter. Flavanoids is 3.1 milligrams per liter. Color intensity is 5.6. Hue is 1.0. OD280/OD315 of diluted wines is 3.9. Proline is 1065."

Diabetes Dataset

The Diabetes dataset (Smith etal., 1988), often referred to as the Pima Indians Diabetes Database, is a frequently used dataset in the domain of medical informatics for predicting the onset of diabetes based on diagnostic measures. This dataset consists of 768 instances, each representing a female at least 21 years old of Pima Indian heritage. The dataset encompasses several medical predictor variables including the number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skinfold thickness, 2-hour serum insulin, body mass index, diabetes pedigree function, and age. The target variable indicates whether the individual was diagnosed with diabetes (1) or not (0), making it a binary classification problem. This dataset is pivotal in the development and testing of predictive models aimed at diagnosing diabetes early and has been instrumental in numerous studies related to machine learning in healthcare.

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome61487235033.60.65011856629026.60.43108183640023.30.7321

{mdframed}

Serialized Text:
The Age is 50. The Number of times pregnant is 6. The Diastolic blood pressure is 72. The Triceps skin fold thickness is 32. The Plasma glucose concentration at 2 hours in an oral glucose tolerance test (GTT) is 148. The 2-hour serum insulin is 0. The Body mass index is 33.6. The Diabetes pedigree function is 0.6.

FeatureImportance Score
Pregnancies23.93
Glucose163.60
Blood Pressure2.04
Skin Thickness4.80
Insulin8.92
BMI62.25
Diabetes Pedigree Function16.77
Age37.07

{mdframed}

Feature Selected Serialized Text:
The Age is 50. The Number of times pregnant is 6. The Plasma glucose concentration at 2 hours in an oral glucose tolerance test (GTT) is 148. The Body mass index is 33.6. The Diabetes pedigree function is 0.6.

Titanic Dataset

The Titanic dataset (Eaton & Haas, 1995) is one of the most iconic datasets used in the realm of data science, especially for beginners practicing classification techniques. It comprises passenger records from the tragic maiden voyage of the RMS Titanic in 1912. This dataset typically includes 891 instances, representing a subset of the total passenger list. Each instance includes various attributes such as passenger class (Pclass), name, sex, age, number of siblings/spouses aboard (SibSp), number of parents/children aboard (Parch), ticket number, fare, cabin number, and port of embarkation. The primary objective with this dataset is to predict a passenger’s survival (1 for survived, 0 for did not survive), making it a binary classification problem. The Titanic dataset not only challenges model builders to predict survival outcomes accurately but also provides an opportunity to explore data preprocessing techniques like handling missing values, feature engineering, and categorical data encoding. It serves as a practical introduction to machine learning tasks and is frequently used in educational settings to demonstrate the steps involved in the data science workflow from preprocessing to model evaluation.

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked103Braund, Mr. Owen Harrismale22.010A/5 211717.2NaNS211Cumings, Mrs. John Bradley (Florence Briggs Tha…female38.010PC 1759971.3C85C313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9NaNS

{mdframed}

Serialized Text:
Passenger Name is Mr. Own Harris Broaund. Passenger is 22-years-old. Passenger is male. They paid $7.2. They are in 3rd-class ticket. They embarked from Southhampton. They are with 1 sibling(s)/spouse(s). They are with 0 parent(s)/children. They are staying in cabin Unknown.

{mdframed}

Modified Serialized Text: (SOTA)
Passenger Mr. Own Harris Broaund, a 22-year-old male, paid $7.2 for a 3rd-class ticket and embarked from Southhampton. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children, they were aboard in cabin Unknown.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (6)

{mdframed}

Feature Selected Serialized Text:
Passenger Mr. Own Harris Broaund, a 22-year-old male, paid $7.2 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

B.2 Experimental Datasets

Home Equity Line of Credit (HELOC) Dataset

The Home Equity Line of Credit (HELOC) dataset is a rich resource for data scientists and machine learning practitioners focusing on financial decision-making processes. This dataset, sourced from real loan applications, includes data from applicants who applied for a home equity line of credit from a lending institution. It features approximately 10,459 instances, each characterized by a series of attributes that are critical in assessing creditworthiness and risk. These attributes include borrower’s credit score, loan to value ratio, number of derogatory remarks, total credit balance, and more, comprising a total of 23 predictive attributes plus a binary target variable. The target variable indicates whether the applicant was approved (1) or rejected (0) for the loan, setting up a binary classification problem. The HELOC dataset not only tests a model’s ability to predict loan approval based on complex interactions between various financial indicators but also pushes the boundaries of responsible AI by emphasizing the need for fair and unbiased decision-making systems in finance. It serves as an excellent basis for developing and refining models that deal with imbalanced data, process personal financial information, and require careful feature engineering and selection to predict outcomes accurately.

RiskPerformanceExternalRiskEstimateMSinceOldestTradeOpenMSinceMostRecentTradeOpenAverageMInFileNumSatisfactoryTradesNumTrades60Ever2DerogPubRecNumTrades90Ever2DerogPubRecPercentTradesNeverDelqMSinceMostRecentDelqBad551444842030832Bad61581541244100-7Bad6766524900100-7

MaxDelq2PublicRecLast12MMaxDelqEverNumTotalTradesNumTradesOpeninLast12MPercentInstallTradesMSinceMostRecentInqexcl7daysNumInqLast6MNumInqLast6Mexcl7daysNetFractionRevolvingBurdenNetFractionInstallBurden352314300033-80870670000-87894440445366

{mdframed}

Serialized Text:
External Risk Estimate is 55.Months Since Oldest Trade Open is 144.Months Since Most Recent Trade Open is 4.Average Months In File is 84.Number of Satisfactory Trades is 20.Number of Trades 60 Ever 2 Derogatory/Public Records is 3.Number of Trades 90 Ever 2 Derogatory/Public Records is 0.Percent Trades Never Delinquent is 83.Months Since Most Recent Delinquency is 2.Max Delinquency 2 Public Record Last 12 Months is 3.Maximum Delinquency Ever is 5.Number of Total Trades is 23.Number of Trades Open in Last 12 Months is 1.Percent Installment Trades is 43.Months Since Most Recent Inquiry Excluding Last 7 Days is 0.Number of Inquiries Last 6 Months is 0.Number of Inquiries Last 6 Months Excluding Last 7 Days is 0.Net Fraction Revolving Burden is 33.Net Fraction Installment Burden is -8.Number of Revolving Trades with Balance is 8.Number of Installment Trades with Balance is 1.Number of Bank/National Trades with High Utilization is 1.Percent of Trades with Balance is 69.

FeatureImportance Score
External Risk Estimate390.94
Months Since Oldest Trade Open282.23
Months Since Most Recent Trade Open14.51
Average Months In File371.41
Number of Satisfactory Trades113.51
Number of Trades 60 Ever 2 Derog/Public Rec45.44
Number of Trades 90 Ever 2 Derog/Public Rec20.50
Percent Trades Never Delinquent116.84
Months Since Most Recent Delinquency33.35
Max Delinquency 2 Public Rec Last 12 Months98.07
Max Delinquency Ever96.19
Number of Total Trades64.18
Number of Trades Open in Last 12 Months10.90
Percent Installment Trades116.30
Months Since Most Recent Inquiry excl 7 days103.23
Number of Inquiries Last 6 Months65.35
Number of Inquiries Last 6 Months excl 7 days58.71
Net Fraction Revolving Burden811.45
Net Fraction Installment Burden67.57
Number of Revolving Trades with Balance19.75
Number of Installment Trades with Balance13.88
Number of Bank/National Trades with High Utilization6.33
Percent of Trades with Balance337.51

{mdframed}

Feature Selected Serialized Text:
External Risk Estimate is 55.Months Since Oldest Trade Open is 144.Average Months In File is 84.Number of Satisfactory Trades is 20.Percent Trades Never Delinquent is 83.Max Delinquency 2 Public Record Last 12 Months is 3.Maximum Delinquency Ever is 5.Number of Total Trades is 23.Percent Installment Trades is 43.Months Since Most Recent Inquiry Excluding Last 7 Days is 0.Number of Inquiries Last 6 Months is 0.Number of Inquiries Last 6 Months Excluding Last 7 Days is 0.Net Fraction Revolving Burden is 33.Net Fraction Installment Burden is -8.Percent of Trades with Balance is 69.

Credit Card Fraud Dataset

The Credit Card Fraud dataset (DalPozzolo etal., 2014, 2017, 2015), available on Kaggle, is a critical dataset in the financial sector for the development and testing of anomaly detection systems. This dataset contains transactions made by credit cards in September 2013 by European cardholders. It consists of 284,807 transactions, where each transaction is represented by 31 features. These features include 28 numerical input variables (V1 to V28) which are the result of a Principal Component Analysis (PCA) transformation to protect sensitive information, the transaction amount (Amount), and the time since the first transaction in the dataset (Time). The target variable is binary, indicating fraud (’1’) or not fraud (’0’), making it a binary classification problem. The dataset is highly imbalanced, with fraud transactions making up only 0.172% of all transactions. This dataset challenges researchers to effectively detect fraudulent transactions in a highly imbalanced data setting, which is crucial for preventing financial losses due to fraud and is extensively used in machine learning research focused on fraud detection.

TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15
0.0-1.4-0.12.51.4-0.30.50.20.10.40.1-0.6-0.6-1.0-0.31.5
0.01.20.30.20.40.1-0.1-0.10.1-0.3-0.21.61.10.5-0.10.6
1.0-1.4-1.31.80.4-0.51.80.80.2-1.50.20.60.10.7-0.22.3
V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass
-0.50.20.00.40.3-0.00.3-0.10.10.1-0.20.1-0.0149.60
0.5-0.1-0.2-0.1-0.1-0.2-0.60.1-0.30.20.1-0.00.02.70
-2.91.1-0.1-2.30.50.20.80.9-0.7-0.3-0.1-0.1-0.1378.70

{mdframed}

Serialized Transaction Data:
V1 is -1.4. V2 is -0.1. V3 is 2.5. V4 is 1.4. V5 is -0.3. V6 is 0.5. V7 is 0.2. V8 is 0.1. V9 is 0.4. V10 is 0.1. V11 is -0.6. V12 is -0.6. V13 is -1.0. V14 is -0.3. V15 is 1.5.V16 is -0.5. V17 is 0.2. V18 is 0.0. V19 is 0.4. V20 is 0.3, V21 is -0.0. V22 is 0.3. V23 is -0.1. V24 is 0.1. V25 is 0.1. V26 is -0.2. V27 is 0.1. V28 is -0.0.

FeatureImportance Score
V12527.72
V21998.44
V39026.38
V44002.88
V52345.90
V6428.86
V78861.27
V887.15
V92133.98
V1010886.90
V115309.16
V1215834.84
V134.13
V1421806.04
V154.06
V168917.15
V1727131.19
V182917.22
V19270.12
V2093.85
V21478.77
V221.30
V231.10
V248.64
V253.87
V264.44
V2715.92
V2837.68
Amount8.72

{mdframed}

Serialized Transaction Data:
V1 is -1.4. V2 is -0.1. V3 is 2.5. V4 is 1.4. V5 is -0.3. V6 is 0.5. V7 is 0.2. V9 is 0.4. V10 is 0.1. V11 is -0.6. V12 is -0.6. V14 is -0.3.V16 is -0.5. V17 is 0.2. V18 is 0.0. V19 is 0.4. V20 is 0.3, V21 is -0.0.

San Francisco Crime Dataset

The San Francisco Crime dataset (Kan, 2015), available on Kaggle, is an extensive dataset widely used in the domain of predictive modeling and public safety analytics. It includes incidents derived from the San Francisco Police Department’s crime incident reporting system, spanning over 12 years from 2003 to 2015. This dataset features over 878,049 instances, each documented with several attributes such as dates, police department district, the category of the crime, the description of the incident, day of the week, and geographical coordinates (latitude and longitude).

The primary objective with this dataset is to predict the category of crime that occurred, making it a multiclass classification problem. Each record is classified into one of 39 distinct crime categories, which include varying offenses from larceny/theft, non-criminal, assault, to drug/narcotic violations. This dataset challenges data scientists to analyze and predict crime patterns based on temporal and spatial features, which is crucial for law enforcement agencies to allocate resources effectively and improve public safety. The San Francisco Crime dataset not only serves as a critical resource for training machine learning models to understand urban crime dynamics but also provides insights into the effectiveness of different policing strategies over time.

DatesCategoryDescriptDayOfWeekPdDistrictResolutionAddressXY2015-05-13 23:53:00WARRANTSWARRANT ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42537.7742015-05-13 23:53:00OTHER OFFENSESTRAFFIC VIOLATION ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42537.7742015-05-13 23:33:00OTHER OFFENSESTRAFFIC VIOLATION ARRESTWednesdayNORTHERNARREST, BOOKEDVANNESS AV / GREENWICH ST-122.42437.800

{mdframed}

Serialized Sentence:
The description of the incident was WARRANT ARREST. The crime happened on Wednesday in the NORTHERN police district. The incident happened at OAK ST / LAGUNA ST, with coordinates (-122.4, 37.8).

Gene Expression Profiles for Cancer Target Identification Dataset

The Gene Expression Profiles dataset (Fiorini, 2016) is a vital resource in the burgeoning field of machine learning for drug discovery, specifically in identifying targets for cancer therapies. This dataset consists of gene expression profiles derived from various cancer patients. It includes data from multiple studies focused on different types of cancer, where each sample is described by potentially thousands of gene expression features, reflecting the activity levels of various genes in the tissues sampled from cancer patients.

The primary objective with this dataset is to distinguish between different cancer types or to predict the response of various cancers to treatments, making it an essential tool for multiclass classification or regression problems in biomedical research. The complexity of the dataset, due to the high dimensionality of the feature space and the biological variability among samples, poses significant challenges in model building, feature selection, and interpretation of results.

gene_0gene_1gene_2gene_3gene_4gene_5gene_6gene_7gene_8gene_9gene_10gene_11gene_12gene_13gene_14gene_15gene_16gene_17gene_200000.02.03.35.510.40.07.20.60.00.00.61.32.00.60.00.00.00.00.40.00.61.67.69.60.06.80.00.00.00.00.62.51.00.00.00.00.00.00.03.54.36.99.90.07.00.50.00.00.00.52.01.10.00.00.00.01.3

{mdframed}

Serialized Text:
Gene 0 is 0.0. Gene 1 is 0.6. Gene 2 is 1.6. Gene 3 is 7.6. Gene 4 is 9.6. Gene 5 is 0.0. Gene 6 is 6.8. Gene 7 is 0.0. Gene 8 is 0.0. Gene 9 is 0.0.
Gene 10 is 0.0. Gene 11 is 0.6. Gene 12 is 2.5. Gene 13 is 1.0. Gene 14 is 0.0. Gene 15 is 0.0. Gene 16 is 0.0. Gene 17 is 0.0. Gene 18 is 0.0. Gene 19 is 11.1.
Gene 20 is 3.6. Gene 21 is 0.0. Gene 22 is 10.1. Gene 23 is 0.0. Gene 24 is 0.0. Gene 25 is 0.0. Gene 26 is 9.9. Gene 27 is 8.5. Gene 28 is 1.2. Gene 29 is 4.9.
Gene 30 is 0.0. Gene 31 is 0.0. Gene 32 is 5.8. Gene 33 is 1.3. Gene 34 is 13.3. Gene 35 is 6.7. Gene 36 is 0.6. Gene 37 is 0.0. Gene 38 is 9.5. Gene 39 is 0.8.
Gene 40 is 9.7. Gene 41 is 0.0. Gene 42 is 0.3. Gene 43 is 0.0. Gene 44 is 2.7. Gene 45 is 6.7. Gene 46 is 9.8. Gene 47 is 8.8. Gene 48 is 11.5...

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (7)

B.3 Feature Scaling Experiments

We performed various feature scaling techniques to correct the skewness of the Titanic data set. Below we display examples of serialized sentences with the applied transforms.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (8)

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (9)

Standardization

z=xμσ𝑧𝑥𝜇𝜎z=\frac{x-\mu}{\sigma}italic_z = divide start_ARG italic_x - italic_μ end_ARG start_ARG italic_σ end_ARG(1)

These symbols denote the following: z𝑧zitalic_z represents the standardized value, x𝑥xitalic_x stands for the original value, μ𝜇\muitalic_μ denotes the mean of the data, and σ𝜎\sigmaitalic_σ signifies the standard deviation of the data.

{mdframed}

Standardized Selected Serialized Text:
Passenger Mr. Own Harris Broaund, a -0.565-year-old male, paid $-0.502 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

Normalization

xnorm=xmax(x)subscript𝑥norm𝑥max𝑥x_{\text{norm}}=\frac{x}{\text{max}(x)}italic_x start_POSTSUBSCRIPT norm end_POSTSUBSCRIPT = divide start_ARG italic_x end_ARG start_ARG max ( italic_x ) end_ARG(2)

In this context, xnormsubscript𝑥normx_{\text{norm}}italic_x start_POSTSUBSCRIPT norm end_POSTSUBSCRIPT denotes the normalized value of x𝑥xitalic_x, where x𝑥xitalic_x stands for the original value, and max(x)max𝑥\text{max}(x)max ( italic_x ) represents the maximum value in the dataset.

{mdframed}

Normalized Serialized Text:
Passenger Mr. Own Harris Broaund, a 0.271-year-old male, paid $0.014 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

Log Transformation

y=log(x)𝑦𝑥y=\log(x)italic_y = roman_log ( italic_x )(3)

In this context, y𝑦yitalic_y represents the logarithmically transformed value of x𝑥xitalic_x, where x𝑥xitalic_x stands for the original value.

{mdframed}

Feature Selected Serialized Text:
Passenger Mr. Own Harris Broaund, a 3.135-year-old male, paid $2.110 for a 3rd-class ticket. They were accompanied by 1 sibling(s)/spouse(s) and 0 parent(s)/children.

Appendix C Foundation Models Table

ModelDescription
BERT (Devlin etal., 2018)Originally pretrained on a corpus consisting of Wikipedia and BookCorpus using masked language modeling (MLM) and next sentence prediction (NSP) tasks to generate bidirectional context representations.
DistilBERT (Sanh etal., 2019)A lighter version of BERT, retaining most of its predecessor’s capabilities but with fewer parameters, pretrained using a knowledge distillation process during the MLM task.
RoBERTa (Liu etal., 2019)A variant of BERT optimized through more extensive training on larger data and removing the NSP task, focusing solely on the MLM for better performance.
Electra (Clark etal., 2020)Trained using a replaced token detection rather than masked language modeling, Electra discriminates between ”real” and ”fake” tokens across a corpus, allowing for more efficient learning.
XLNet (Yang etal., 2019)Combines the best of autoregressive and autoencoding techniques, pretrained on a permutation-based language modeling task, which captures bidirectional contexts dynamically.
Albert (Lan etal., 2019)A lite BERT that introduces parameter-reduction techniques to increase training speed and lower memory consumption, focusing on MLM and sentence-order prediction.
Deberta (He etal., 2020)Enhances BERT and RoBERTa models by incorporating disentangled attention and a new way of encoding positional information, improving on MLM and NSP tasks.
GPT-2 (Radford etal., 2019)Utilizes a left-to-right autoregressive approach in its pretraining, allowing each token to condition on the previous tokens in a sequence, optimized for a variety of natural language understanding tasks.
Longformer (Beltagy etal., 2020)Designed for longer texts, this model extends the BERT architecture by employing a combination of sliding window and global attention mechanisms, focusing on efficiency and scalability.
GTE Large (Li etal., 2023)The general text embedding model (GTE) using a multi-contrastive learning pre-training objective. Scored very high in the MTEB benchmark in Text Classification.
GTE BaseSimilar to GTE Large but with fewer parameters, focused on achieving comparable performance to larger models while being more computationally efficient.

C.1 Results of Language Model Evaluation

ModelLossAccuracyPrecisionRecallF1 ScoreAUROCAUPRCRuntime (s)Samples/s
Bert0.49030.78210.75360.70270.72730.84830.82625.093335.144
DistilBert0.45350.80450.70970.89190.79040.87430.84262.607268.656
RoBERTa0.55470.79890.73170.81080.76920.82060.74484.743437.737
Electra0.45830.82680.75290.86490.80500.85150.76655.110135.029
XLNet0.55740.78210.75360.70270.72730.85290.822217.33610.325
Albert0.48020.79890.72620.82430.77220.83870.76375.825230.729
DeBERTa0.50570.79330.73420.78380.75820.80590.70063.256754.964
GPT20.69470.65920.88240.20270.32970.84080.78772.070486.456
Longformer0.50920.79890.74360.78380.76320.81380.67423.772647.447
GTE-large0.52260.79330.77610.70270.73760.87040.79476.488527.587
GTE Base0.53360.78210.90700.52700.66670.87250.81392.167782.575
Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (10)

Appendix D Hyperparameters of Baseline Models

Hyperparameters play a pivotal role in machine learning, governing various aspects of the model training process. In our work, we utilized grid search to systematically explore the optimal settings for different models. We extracted hyperparameters from the (Hegselmann etal., 2023) paper for the LGBM and XGBoost mdoels. For XGBoost, we configured parameters such as max_depth ranging from 2 to 12, lambda_l1 and lambda_l2 from 1e81𝑒81e-81 italic_e - 8 to 1.01.01.01.0, and eta from 0.01 to 0.3. For LightGBM, we examined num_leaves from 2 to 4096, lambda_l1 and lambda_l2 extending up to 10.0, and learning_rate from 0.01 to 0.3. The SVM model with an RBF kernel was tested with C values between 0.1 and 100, and gamma values including 0.001 to 1, as well as auto and scale. This comprehensive hyperparameter tuning enhances the model’s performance by ensuring the most effective parameter combinations are identified, leading to improved accuracy and robustness.

XGBoost
Modelxgb.XGBClassifier(random_state=42)
Parameters
max_depth: [2, 4, 6, 8, 10, 12]
lambda_l1: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]
lambda_l2: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]
eta: [0.01, 0.03, 0.1, 0.3]
LightGBM
Modellgb.LGBMClassifier(random_state=42)
Parameters
num_leaves: [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
lambda_l1: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]
lambda_l2: [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]
learning_rate: [0.01, 0.03, 0.1, 0.3]
SVM (RBF)
ModelSVC(probability=True, random_state=42)
Parameters
C: [0.1, 1, 10, 100]
gamma: [0.001, 0.01, 0.1, 1, ’auto’, ’scale’]
kernel: [’rbf’]

Appendix E Feature Selection Methods

ANOVA F-test

The ANOVA F-test feature selection method works by computing the ANOVA F-value between each feature and the target variable for classification tasks. The ANOVA F-value is a ratio of the between-group variability to the within-group variability, and it measures how well a feature can separate the samples into different classes.

Mathematically, the ANOVA F-value for a feature X𝑋Xitalic_X and a target variable Y𝑌Yitalic_Y with k𝑘kitalic_k classes is calculated as follows:

  1. 1.

    Calculate the mean of X𝑋Xitalic_X within each class: μj=Yi=jXi/njsubscript𝜇𝑗subscriptsubscript𝑌𝑖𝑗subscript𝑋𝑖subscript𝑛𝑗\mu_{j}=\sum_{Y_{i}=j}X_{i}/n_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of samples in class j𝑗jitalic_j

  2. 2.

    Calculate the overall mean of X𝑋Xitalic_X: μ=iXi/n𝜇subscript𝑖subscript𝑋𝑖𝑛\mu=\sum_{i}X_{i}/nitalic_μ = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_n, where n𝑛nitalic_n is the total number of samples

  3. 3.

    Calculate the between-group sum of squares (SSB):

    SSB=jnj(μjμ)2SSBsubscript𝑗subscript𝑛𝑗superscriptsubscript𝜇𝑗𝜇2\text{SSB}=\sum_{j}n_{j}(\mu_{j}-\mu)^{2}SSB = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
  4. 4.

    Calculate the within-group sum of squares (SSW):

    SSW=Yi=j(Xiμj)2SSWsubscriptsubscript𝑌𝑖𝑗superscriptsubscript𝑋𝑖subscript𝜇𝑗2\text{SSW}=\sum_{Y_{i}=j}(X_{i}-\mu_{j})^{2}SSW = ∑ start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
  5. 5.

    Calculate the ANOVA F-value:

    F=SSB/(k1)SSW/(nk)𝐹SSB𝑘1SSW𝑛𝑘F=\frac{\text{SSB}/(k-1)}{\text{SSW}/(n-k)}italic_F = divide start_ARG SSB / ( italic_k - 1 ) end_ARG start_ARG SSW / ( italic_n - italic_k ) end_ARG

The higher the F-value, the more discriminative the feature is for separating the classes.

SHAP Values:

The SHAP (SHapley Additive exPlanations) value is a method to explain the output of an XGBoost model f𝑓fitalic_f for a given input vector x=(x1,x2,,xp)𝑥subscript𝑥1subscript𝑥2subscript𝑥𝑝x=(x_{1},x_{2},\ldots,x_{p})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). The SHAP value ϕj(x)subscriptitalic-ϕ𝑗𝑥\phi_{j}(x)italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) for feature j𝑗jitalic_j and instance x𝑥xitalic_x is calculated as:

ϕj(x)=SN{j}|S|!(|N||S|1)!|N|![fx(S{j})fx(S)]subscriptitalic-ϕ𝑗𝑥subscript𝑆𝑁𝑗𝑆𝑁𝑆1𝑁delimited-[]subscript𝑓𝑥𝑆𝑗subscript𝑓𝑥𝑆\phi_{j}(x)=\sum_{S\subseteq N\setminus\{j\}}\frac{|S|!(|N|-|S|-1)!}{|N|!}[f_{%x}(S\cup\{j\})-f_{x}(S)]italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_S ⊆ italic_N ∖ { italic_j } end_POSTSUBSCRIPT divide start_ARG | italic_S | ! ( | italic_N | - | italic_S | - 1 ) ! end_ARG start_ARG | italic_N | ! end_ARG [ italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_S ∪ { italic_j } ) - italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_S ) ]

where:

  • N={1,2,,p}𝑁12𝑝N=\{1,2,\ldots,p\}italic_N = { 1 , 2 , … , italic_p } is the set of all feature indices.

  • S𝑆Sitalic_S is a subset of feature indices from N𝑁Nitalic_N, representing a coalition of features.

  • fx(S)subscript𝑓𝑥𝑆f_{x}(S)italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_S ) is the prediction of the model f𝑓fitalic_f for instance x𝑥xitalic_x using only the features indexed by S𝑆Sitalic_S.

  • |S|𝑆|S|| italic_S | is the cardinality (number of elements) of the set S𝑆Sitalic_S.

The SHAP value ϕj(x)subscriptitalic-ϕ𝑗𝑥\phi_{j}(x)italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) represents the weighted average of the marginal contributions of feature j𝑗jitalic_j to the model’s prediction, with the weights derived from the Shapley value formulation in cooperative game theory.

Specifically, the term [fx(S{j})fx(S)]delimited-[]subscript𝑓𝑥𝑆𝑗subscript𝑓𝑥𝑆[f_{x}(S\cup\{j\})-f_{x}(S)][ italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_S ∪ { italic_j } ) - italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_S ) ] denotes the marginal contribution of feature j𝑗jitalic_j to the prediction when it is added to the coalition of features S𝑆Sitalic_S. The weight |S|!(|N||S|1)!|N|!𝑆𝑁𝑆1𝑁\frac{|S|!(|N|-|S|-1)!}{|N|!}divide start_ARG | italic_S | ! ( | italic_N | - | italic_S | - 1 ) ! end_ARG start_ARG | italic_N | ! end_ARG is the Shapley value weight, ensuring a fair distribution of the total prediction among the features.

To compute the SHAP values, the XGBoost model needs to be evaluated on all possible subsets of features, which can be computationally intensive for high-dimensional datasets. However, efficient approximation algorithms are available in the SHAP library that estimate the SHAP values with reasonable accuracy.

After computing the SHAP values, they can be used for feature selection by ranking the features based on their average absolute SHAP values or by applying a threshold to identify the most important features.

Appendix F Metrics

For pedgogical purposes we define the metrics used in the study.

F.1 Binary Classification Metrics

Accuracy

Accuracy=TP+TNTP+TN+FP+FNAccuracy𝑇𝑃𝑇𝑁𝑇𝑃𝑇𝑁𝐹𝑃𝐹𝑁\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}Accuracy = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_T italic_N + italic_F italic_P + italic_F italic_N end_ARG

Where TP𝑇𝑃TPitalic_T italic_P represents True Positives, TN𝑇𝑁TNitalic_T italic_N represents True Negatives, FP𝐹𝑃FPitalic_F italic_P represents False Positives, and FN𝐹𝑁FNitalic_F italic_N represents False Negatives.

F1 Score

F1=2PrecisionRecallPrecision+RecallF12PrecisionRecallPrecisionRecall\text{F1}=2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+%\text{Recall}}F1 = 2 ⋅ divide start_ARG Precision ⋅ Recall end_ARG start_ARG Precision + Recall end_ARG

Where Precision is calculated as TPTP+FP𝑇𝑃𝑇𝑃𝐹𝑃\frac{TP}{TP+FP}divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG and Recall is calculated as TPTP+FN𝑇𝑃𝑇𝑃𝐹𝑁\frac{TP}{TP+FN}divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG.

Area Under the Receiver Operating Characteristic Curve (AUROC)

AUROC=01TPR(FPR)d(FPR)AUROCsuperscriptsubscript01𝑇𝑃𝑅𝐹𝑃𝑅𝑑FPR\text{AUROC}=\int_{0}^{1}TPR(FPR)\,d(\text{FPR})AUROC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_T italic_P italic_R ( italic_F italic_P italic_R ) italic_d ( FPR )

Where TPR𝑇𝑃𝑅TPRitalic_T italic_P italic_R is the True Positive Rate, calculated as TPTP+FN𝑇𝑃𝑇𝑃𝐹𝑁\frac{TP}{TP+FN}divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG, and FPR𝐹𝑃𝑅FPRitalic_F italic_P italic_R is the False Positive Rate, calculated as FPFP+TN𝐹𝑃𝐹𝑃𝑇𝑁\frac{FP}{FP+TN}divide start_ARG italic_F italic_P end_ARG start_ARG italic_F italic_P + italic_T italic_N end_ARG.

Matthews Correlation Coefficient (MCC)

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCC𝑇𝑃𝑇𝑁𝐹𝑃𝐹𝑁𝑇𝑃𝐹𝑃𝑇𝑃𝐹𝑁𝑇𝑁𝐹𝑃𝑇𝑁𝐹𝑁\text{MCC}=\frac{TP\cdot TN-FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}MCC = divide start_ARG italic_T italic_P ⋅ italic_T italic_N - italic_F italic_P ⋅ italic_F italic_N end_ARG start_ARG square-root start_ARG ( italic_T italic_P + italic_F italic_P ) ( italic_T italic_P + italic_F italic_N ) ( italic_T italic_N + italic_F italic_P ) ( italic_T italic_N + italic_F italic_N ) end_ARG end_ARG

F.2 Multiclass Classification Metrics

Accuracy

Accuracy=iTPii(TPi+FPi+FNi)Accuracysubscript𝑖𝑇subscript𝑃𝑖subscript𝑖𝑇subscript𝑃𝑖𝐹subscript𝑃𝑖𝐹subscript𝑁𝑖\text{Accuracy}=\frac{\sum_{i}TP_{i}}{\sum_{i}(TP_{i}+FP_{i}+FN_{i})}Accuracy = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG

Where TPi𝑇subscript𝑃𝑖TP_{i}italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents True Positives for class i𝑖iitalic_i, FPi𝐹subscript𝑃𝑖FP_{i}italic_F italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents False Positives for class i𝑖iitalic_i, and FNi𝐹subscript𝑁𝑖FN_{i}italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents False Negatives for class i𝑖iitalic_i.

F1 Score (Macro-Averaged)

F1macro=1Ci=1CF1isubscriptF1macro1𝐶superscriptsubscript𝑖1𝐶subscriptF1𝑖\text{F1}_{\text{macro}}=\frac{1}{C}\sum_{i=1}^{C}\text{F1}_{i}F1 start_POSTSUBSCRIPT macro end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT F1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Where C𝐶Citalic_C is the number of classes, F1i=2PrecisioniRecalliPrecisioni+RecallisubscriptF1𝑖2subscriptPrecision𝑖subscriptRecall𝑖subscriptPrecision𝑖subscriptRecall𝑖\text{F1}_{i}=2\cdot\frac{\text{Precision}_{i}\cdot\text{Recall}_{i}}{\text{%Precision}_{i}+\text{Recall}_{i}}F1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 ⋅ divide start_ARG Precision start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ Recall start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG Precision start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + Recall start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, Precisioni=TPiTPi+FPisubscriptPrecision𝑖𝑇subscript𝑃𝑖𝑇subscript𝑃𝑖𝐹subscript𝑃𝑖\text{Precision}_{i}=\frac{TP_{i}}{TP_{i}+FP_{i}}Precision start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, and Recalli=TPiTPi+FNisubscriptRecall𝑖𝑇subscript𝑃𝑖𝑇subscript𝑃𝑖𝐹subscript𝑁𝑖\text{Recall}_{i}=\frac{TP_{i}}{TP_{i}+FN_{i}}Recall start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG.

Matthews Correlation Coefficient (MCC)

MCC=ij(TPi,jTNi,jFPi,jFNi,j)i(TPi+FPi)(TPi+FNi)(TNi+FPi)(TNi+FNi)MCCsubscript𝑖subscript𝑗𝑇subscript𝑃𝑖𝑗𝑇subscript𝑁𝑖𝑗𝐹subscript𝑃𝑖𝑗𝐹subscript𝑁𝑖𝑗subscriptproduct𝑖𝑇subscript𝑃𝑖𝐹subscript𝑃𝑖𝑇subscript𝑃𝑖𝐹subscript𝑁𝑖𝑇subscript𝑁𝑖𝐹subscript𝑃𝑖𝑇subscript𝑁𝑖𝐹subscript𝑁𝑖\text{MCC}=\frac{\sum_{i}\sum_{j}(TP_{i,j}\cdot TN_{i,j}-FP_{i,j}\cdot FN_{i,j%})}{\sqrt{\prod_{i}(TP_{i}+FP_{i})(TP_{i}+FN_{i})(TN_{i}+FP_{i})(TN_{i}+FN_{i}%)}}MCC = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⋅ italic_T italic_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_F italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⋅ italic_F italic_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_T italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_T italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG end_ARG

Where TPi,j𝑇subscript𝑃𝑖𝑗TP_{i,j}italic_T italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents True Positives for classes i𝑖iitalic_i and j𝑗jitalic_j, TNi,j𝑇subscript𝑁𝑖𝑗TN_{i,j}italic_T italic_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents True Negatives for classes i𝑖iitalic_i and j𝑗jitalic_j, FPi,j𝐹subscript𝑃𝑖𝑗FP_{i,j}italic_F italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents False Positives for classes i𝑖iitalic_i and j𝑗jitalic_j, and FNi,j𝐹subscript𝑁𝑖𝑗FN_{i,j}italic_F italic_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents False Negatives for classes i𝑖iitalic_i and j𝑗jitalic_j.

Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning (2024)
Top Articles
Latest Posts
Article information

Author: Tyson Zemlak

Last Updated:

Views: 6335

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Tyson Zemlak

Birthday: 1992-03-17

Address: Apt. 662 96191 Quigley Dam, Kubview, MA 42013

Phone: +441678032891

Job: Community-Services Orchestrator

Hobby: Coffee roasting, Calligraphy, Metalworking, Fashion, Vehicle restoration, Shopping, Photography

Introduction: My name is Tyson Zemlak, I am a excited, light, sparkling, super, open, fair, magnificent person who loves writing and wants to share my knowledge and understanding with you.