The essence of decision is choice; and, to choose, it’s first necessary to know. — Theodore Sorensen
Abstract
Centering on text embedding practices in natural language processing and inherent challenges in the modeling process, this research project seeks to explore the machine learning pipeline of building a predictive model for classifying arXiv research papers into predefined primary categories. Effective auto-classification of documents enables streamlining processes within organizations, helps to derive useful knowledge, and informs better decision-making to drive growth for entities with complex data sets. Our utilization of modern embedding techniques, as an indispensable component in our ML pipeline, involves doc2vec and MPNet, both of which are methods to generate high-dimensional vector representations/embeddings after models have been trained on the full text of labeled arXiv articles. These computed embeddings are then used for the training of a specified artificial neural network model designed for classification. The out-of-sample average accuracies of the proposed models, trained and evaluated on a distributionally balanced dataset, are 52.041% ± 0.167 and 72.267% ± 0.043 with doc2vec and MPNet deployed respectively. Both outperform the baseline method, FB Abs, on identical datasets. Furthermore, in the case of MPNet as an end-to-end model, its performance is on par with those of state-of-the-art embedding methods such as Longformer and SciBERT. This study demonstrates the competitive edge of using MPNet as an embedding layer that captures semantic information about the text represented by vectors from which a customized ANN model learns for downstream classification tasks as well as bolsters the argument that large language models fine-tuned on in-domain data for the adaption of specific tasks, often in practice, achieve superior performance compared to small models trained from scratch. Future work will focus on training and evaluating a distributionally imbalanced dataset similar to the distribution of real-world arXiv submissions, further exploring MPNet’s capability as a standalone classification model, investigating models with encoder-decoder transformer-based architecture like T5 or BART, and applying them to text generation and summarization in different domains.
The arXiv serves as a preprint repository for research papers across several fields. This project aims to develop a machine learning algorithm that utilizes modern embedding techniques, based on neural networks and recent invention - transformers, to classify preprints automatically by assigning them to their corresponding primary categories. While arXiv has recently introduced an API for article classification using established methods, we aim to investigate whether more advanced embedding techniques can strengthen auto-classification performance. Throughout the project, we will explore multiple potential designs to make such an improvement possible.
Data collection is the foundation of every machine learning project. Open-access platforms like arXiv generally provide ways to download their data under guidelines. Out of respect for its terms and services and to avoid negative consequences, we first visited arXiv’s website to assess download options for both the metadata and full text as opposed to crawling its export service without permissions. We soon learned that bulk access to the datasets is provided on AWS and Kaggle via Google Cloud storage. Since Kaggle’s datasets are updated regularly and require no fees, it is no surprise for us to settle on the Kaggle option to gather these large datasets for our research.
Gcloud and gsutil are the two most commonly used command-line tools to interact with services on Google Cloud, especially Google Cloud storage. For any dataset that has as many articles as arXiv's archive does (pun intended), the inherent difficulties of downloading them in full is trickily challenging, to say the least. We want a reasonable timeframe within which the PDFs are downloaded and therefore download speed is one consideration of ours. Secondly, since errors are highly anticipated due to the nature of real-world data, we want to be cautious in the downloading process such that any unnecessary mislabeling or human errors, leading to a failure while the article can easily be obtained otherwise, can be dealt with or reconciled beforehand. With these in mind, our solution is to use gsutil along with multi-threading operations to not only speed up the downloading but also give us more fine-grained control over storage operations when interacting with web servers.
Although each downloaded PDF is an item in the data collection process, the content of the PDFs, namely the text, should be the input for the embedding models later down the line. Therefore, an indispensable intermediate step needs to be implemented to convert PDFs to text via helper functions along with multi-processing/parallel execution to shorten conversion time. The product of this procedure is the raw full text, before preprocessing, to train embedding models.
While doc2vec and MPNet have differing underlying architectures and thus require different data formats, one shared data transformation for the text is tokenization. In essence, each tokenized word or sub-word is the basic unit input for embedding models to learn and capture the semantic and contextual meaning of the words, provided that the models have the capability, i.e., context window, to handle the given length of the text. It is a must-do in natural language processing tasks. In addition, other data cleaning techniques including removal of stop words & extra white space, and lowercasing would then be applied to reduce the noise in the data.
Lastly, one important thing to consider when working on a classification project is the distribution of the data. Real-world data rarely has a balanced distribution and if we train our models with such data, overfitting in the abundant majority data and underfitting in a few minority data will negatively impact the models’ generalizability. Models will learn to be biased toward predicting the majority class and ignoring the minority. Evaluation metrics are critical in monitoring to help spot problems of class imbalance.
Mathematical predictive models learn from numbers only, not language words. Since they cannot process text directly, words in the text need to be turned into numerical representations and this is where embedding methods like doc2vec, an extension of word2vec which was a major breakthrough in NLP in 2013, come into play. In the world of machine learning, advanced statistical techniques have been developed to detect underlying structures, trends, relationships, or patterns in numerical data, if they exist, thus enabling predictions. When dealing with text, what we want with the help of embedding methods is the preservation of the “structure”, namely the semantic meaning, of the text in its numerical form so that our mathematical models can effectively learn the qualitative aspect of the text and, as a result, have a sufficient level of language understanding to perform various downstream tasks like, in our case, classifying documents into their appropriate categories. But how do we gauge whether the embedding models do a good job of learning the textual meaning? One way to find out is by examining the closeness of their outputs—token embeddings. Tokens are, again, fragments of words, characters, words, or any basic units into which raw text gets decomposed. Embeddings, the numerical representations of tokens, work as vectors of real numbers that live in a hyper-dimensional vector space. Consequently, the geometrical distance between points in a space corresponds to the semantic similarity/relationship/connection between words(tokens) represented by those points(vectors/embeddings). Neat! We can easily obtain the distance by calculating the 2 vectors’ dot product between -1 and 1. The higher the dot product, the more closely the 2 words are semantically related. For example, tea and coffee should ideally have a positive above-0.5 cosine similarity score and the same truth should hold for words like Eiffel and tower, cats and meow, and so on. You get the gist.
When it comes to document-level vectors, the doc2vec model simply tags an extra vector representing the whole document in the training process while utilizing a training algorithm called Distributed Memory to capture the overall meaning of the documents, whereas, the MPNet model does not internally produce document-level vectors. Instead, we need to create a pooling function to take the mean of all the word vectors in a document and count the “average” vector as a document vector. MPNet’s underlying architecture is much more complex. We will get into the details of its inner working mechanism later on.
In this project, we experimented with 3 machine-learning algorithm designs: doc2vec (embedding layer) + simple neural network (classifier), MPNet (embedding layer) + simple neural network (classifier), MPNet (embedding layer + classifier). A classifier is a model that learns from the document vectors and outputs a one-hot encoded vector mapping to a particular category.
The overall workflow is the following, except for MPNet as end-to-end classification:
1. Preprocess before splitting the dataset into training and testing datasets
2. Train the embedding models with training data of preprocessed full text
3. Infer or compute embeddings from the trained embedding models using the same training data
4. Feed the inferred embeddings into a simple artificial neural network for training
5. Infer or compute embeddings from the trained embedding models using the test data
6. Evaluate the trained neural network on the inferred embeddings of test data
7. Collect performance metrics for research analysis
Doc2vec offers two operations to obtain document embeddings—inference and extraction. One caveat is to adhere to the real-world practice of embedding inference by running train and test data through doc2vec. What I mean by that is we should not extract (retrieve) embeddings from a trained doc2vec model even though the extracted ones are the most accurate vector representations of documents. There are two reasons: overperformance due to data leakage and unrealistic re-training to incorporate new data. First of all, to extract the embeddings of the test data, the model has to see the data prior to extraction. This creates a conflict between convenience and honesty. Testing the model on seen data is like testing a student on recycled examine papers. He or she will do much better simply by rote memorization. It speaks nothing about the model’s generalizability which is the core of any predictive model’s capability. So, we need to split the entire tokenized text and only train doc2vec on the training portion and then use both training and testing tokenized text to infer embeddings for training and testing a classifier. Moreover, inference does not require re-training of embedding models, which is often costly, to get new vectors for new documents.
Now, let’s talk about MPNet, fine-tuned and used as an end-to-end model. It is a large language model novel pre-training method designed by combining masked language modeling with permuted language modeling to better learn context representations. MPNet LLM generates context-dependent embeddings for tokens (words or phrases), which can be pooled or averaged to represent documents in given contexts, because of its transformer-based design architecture. The transformer, a recent invention that gives rise to the AI boom nowadays, is equipped with a component called encoder which utilizes a key technique called self-attention mechanism to make its embeddings much richer in semantic, contextual, and positional meaning for larger bodies of text like sentences, paragraphs or even entire documents. Despite their inability to generate text, encoder-only transformers like BERT or MPNet are generally capable of making task-related predictions such as sequence classification and regression so long as we use fine-tuning, i.e., training a classification head that maps model outputs to the preferred number of outcomes, and in the meantime, updating all weights of the MPNet model. We can think of this step as choosing the spray modes in a multifunction shower head. The water is streamlined and the pressure is strong and ready to deliver. Wide-coverage spray produces an encompassing spray for everyday showering whereas a targeted spray is a focused stream for cleaning up scum or dried hair in the tub. Our MPNet via self-attention mechanism learns well about the full text we have trained it on. After that, fine-tuning to update as well as adapt the model for various downstream tasks offers customizability to suit multiple modeling objectives and needs.
The out-of-sample performances of the proposed models, trained and evaluated on a distributionally balanced dataset, are 52.041% ± 0.167 and 72.267% ± 0.043 average accuracy with doc2vec and MPNet deployed respectively. Both outperform the baseline method, FB Abs, with identical datasets. Furthermore, in the case of MPNet as a standalone model, its performance metric, 72.581% ± 0.089, is on par with those of state-of-the-art embedding methods such as Longformer and SciBERT. This study demonstrates the competitive edge of using MPNet as an embedding layer that captures semantic information about the text represented by vectors from which a customized ANN model learns for downstream classification tasks as well as bolsters the argument that large language models fine-tuned on in-domain data for adapting to specific tasks, often in practice, achieve superior performance compared to small models trained from scratch.
A high-end Mac Studio with 24-core CPU, 60-core GPU, and 128 GB unified memory was used as a server for this project. Each member needed to SSH into the server via a secure campus VPN at every step of our machine-learning workflow. This manual process for developing ML models comes with a few glitches worth mentioning here. First, it pays to be diligent in saving your work (get used to pressing the “save” hotkey!) especially when your internet connection is spotty. I got burnt at times and trust me, you won’t like the feeling after losing hours of hard work. Secondly, the same advice for a different reason. Training models on large datasets takes time. Computers going to sleep, accidentally pressing the wrong key, or any other mistake could stop the running script and waste time. One solution is to use the no hup command to allow programs to run in the background or with a closed terminal. However, changes made to the script need to be saved before running on the no hup command. Otherwise, the changes won’t take effect. Those who are used to writing code in Jupiter's notebook need to take notes here. On a related note, I think there are tips and tricks we could borrow or principles we could follow from the DevOps side to have more efficient and automated processes over the course of the project.
Next, I want to shed light on some behind-the-scenes hardware-level stuff that often gets obscured in online articles for some reason. Over the course of the project, we invested a good amount of time in setting up a framework that allowed the server to perform compute-intensive tasks using more than just the CPU, without which, simply loading the data into a pandas dataframe object is not realistically time-efficient, let alone training an LLM. In other words, it is necessary to utilize all GPU resources for every large-scale training task. Since we use a Mac, Metal, Apple’s in-house API that makes GPU perform computations in parallel (not concurrently), is our choice. If anyone uses Linux or Windows, Nvidia's CUDA is the API people want to call in their scripts. Long story short, the immediate benefits to us, the modelers, are significantly shorter training and real-time processing if you are like me who enjoy staring at the terminal watching read-time feedback of training stats. Some might say hardware acceleration leads to better model performance, i.e., accuracy or precision. We are not sure about this one, but it could be true. As I lightly touched on early on, we also used multithreading, another trick of the trade to speed up things including the download of arXiv PDFs. This technique is well-suited for all things I/O-bound, especially if you have a network connection with high bandwidth.
Moving on, let's turn to project assessment. Scope, timeline, expectation, and feasibility are all important aspects of a project. As far as the machine learning project is concerned, I think the data is one of the most significant factors to consider. Its size, availability, and complexity directly influence the success of the project. There’s no surprise that most, if not all, machine learning engineers and data scientists spend most of their time cleaning the data. Real-world data are messy and tricky to handle. Having a sufficient level of understanding of the data gives people a clearer sense of the feasibility of a project. Case in point, we had to learn the hard way in the midst of the project that the metadata we’d been using was not in the version we thought it was. There’s a storage version discrepancy issue that was not communicated online. Additionally, many arXiv categories were, in practice, subsumed into other categories based on arXiv internal adjustment. Yet, such a policy was not made public. This “mistake” led us to train our models on much noisier data with unavoidably subpar performance. Luckily, through trial and error, we were able to self-correct and mitigate the issue in a week. Another factor to consider is the project objective, which has to be concrete and clearly defined, a criterion or a numeric value that has the final say in determining success or failure. As the old cliché goes, “without a goal, you cannot score.” Research work can no doubt evolve, but knowing what the goal is can help focus team members' attention to the important work that will contribute to achieving the goal. More importantly, it might help to stop wasteful effort spent on unknowingly irrelevant work.
Future developments will involve training and evaluating a distributionally imbalanced dataset similar to the distribution of real-world arXiv submissions, further exploring MPNet’s capability as an end-to-end classification model, investigating models with encoder-decoder transformer-based architecture like T5 or BART, and applying them to text generation and summarization in different domains like virtual immortality and legacy systems.
We thank Dr. Jonathan Young and Albert Gong for providing experimental datasets and helpful discussions with us. This work is supported by the National Science Foundation under grant #2212922 and Queens College Office of Undergraduate Research. On a personal note, Nick gave me an opportunity for intellectual growth and practical advice to overcome personal adversity, for which I am sincerely grateful. Also, I thank my collaborators—Daisy and Kathy—for many stimulating conversations and shared laughs.