I have been quietly working for the last three years: a novel hierarchical and e...

I have been quietly working for the last three years: a novel hierarchical and extensible modeling framework that can cleanly and efficiently embed any json-like object for any predictive modeling task with zero feature engineering.

json2vec enables users to, for example, build tabular / transactional foundation models like TabBERT / PRAGMA dynamically... by just declaring their data schema. This is a space in which Netflix, Stripe, Revolut, Capital One, Nubank, J.P. Morgan, NVIDIA, etc. have been developing for several years.

json2vec goes a step further from just tabular data or structured transactional data. It enables arbitrary structured "json-like" observations with hierarchical BERT-like transformer encoder blocks. Financial transactions, chess positions, flight itineraries, raw tabular data, rideshare activity, ecommerce, behavioral sequence models... Any raw data able to be represented in `json` can be encoded into a tree of embeddings, and used for downstream finetuning for supervised machine learning... No feature engineering required.

https://github.com/granthamtaylor/json2vec

json2vec supports extensible plugin support for new data types (numbers, categories, raw text, datetimes, hashable objects [think: IP addresses and phone numbers], and raw embeddings), all of which may be pretrained via MLM-like self-supervised learning. If your needs are not met with the built-in datatypes, the framework is extensible in that you may build your own custom datatypes (think: geographical coordinates). Built in decision heads for a subset of datatypes enable predictive modeling multi-task and multi-array outputs (predicting fraud at a per-transaction level, or a per-account level).

json2vec also enables built in data pipelines for 100b+ training observations streaming from cloud storage. These pipelines integrate with layer of programmatic data querying and UDFs can consume the vast majority of upstream data processing so that developers don't waste time on massive batch data preprocessing jobs before model training.

Oh, and the best part: the model architectures instantiated by json2vec are mutable. Model developers can add and remove features and targets at their whim - allowing for truly reusable foundation models that can adapt for each individual use case.

My hope is that with a standardized hierarchical modeling framework, interested organizations can better collaborate with one by sharing reusable logic with one another instead of hardcoding use-case-specific architecture.