add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140

danbri · 2022-07-20T18:13:07Z

Dataset is pretty vague, it can cover anything from .zip files of .wavs of social science interviews, application-specific on-disk file formats, etc etc. In theory we could have subtypes for all of these, but it would be a neverending task, so we have tended to address such needs through indirect mechanisms like https://schema.org/fileFormat

The practice of publishing and consuming ML models is different, and worth exploring a special subtype, to allow consuming services to differentiate machine learning models from other unrelated things. This is partially to do with the massive attention associated with "ML", "AI" etc., but also relates to the relatively inscrutable nature of trained ML models. You get something that can map inputs to outputs or act as a controller of an agent of some kind, ... but how it works internally is relatively unclear. This is in stark contrast to the most stereotypical kinds of Datasetty Dataset, i..e. where you have the rows and columns of descriptive tables, and some reasonable clarity on how the parts of the dataset relate to the whole.

In many ways machine learning models are more like software artifacts than they are like (for example) "open data" CSV files. Historically Schema.org hasn't included https://schema.org/SoftwareSourceCode or https://schema.org/SoftwareApplication as subtypes of /Dataset, ... although a case could be made to do so. Similarly Docker (etc.) packaging, or Debian/CPAN/NPM etc packages are somewhat related.

This issue comes out of discussions (with colleagues and in wider world) about the scope of "Dataset", and the concern that it will just be super confusing to try to persuade ML folk to markup up models as datasets, whereas in their preferred technology, models are built from datasets.

So -

How about we add (as an exploratory, "Pending" term) a type: MLModel

Description: A schema.org MLModel represents a machine learning model. These will typically but not necessarily consist of some data describing the general setup and architecture of the model, and a larger set of parameters. The specific format of the model can be indicated via fileFormat, and might be proprietary, toolkit specific, or adhere to some published standard.

Adding an MLModel type would allow publishers to distinguish models from other datasets more clearly, and in a way that could be used by consuming applications. Recently there has been an increasing recognition of the risks around bias and inappropriate re-use of pre-trained machine learning models, as well as a concern that the high cost of training certain kinds of model makes it important for such models to be shareable, lowering barriers to entry of the ML/AI field. In this climate, it is ideal when data publishers provide not only the information describing a pre-trained machine learning model they're offering, but also the dataset(s) upon which the model was trained. Calling both of these published artifacts "Dataset" may be confusing, ...despite ML models fitting under Schema.org's pretty broad notion of Dataset. My proposal here partially resolves this by adding a convenience type. For those who care to dig they might find out that MLModel is considered a special kind of Dataset, but most won't even encounter this information.

There's more to say on this but I've written too much already, and wanted to start some discussion. Opinions welcomed!

HughP · 2022-07-20T18:55:46Z

I concur that dataset in schema.org is vague. And that an MLModel is not a dataset, but an abstraction of patterns found in a data stimulus. The term dataset is also hard to understand (or often misunderstood) in the dublin core world as well. From an archivist's perspective (even a digital archivist's perspective), I'd like to point out there is a difference between a collection and a dataset. A collection is a set of items (files) associated for some reason, whereas a dataset is curated, arranged, and designed to be computer read as a coherent data structure. This applies to to things like a collection of .wav files. Rarely in academic preservation contexts are .wav files a data set. Though it is possible to write code to find structure in anything. Ideally, a dataset would conform to an arrangement specification.

egonw · 2022-07-22T18:02:13Z

A Dataset is something else than a machine learned model. I also agree there is something missing in schema.org for 'model'. I would suggest, however, to introduce Model instead of the more specific MLModel. Or perhaps both, with MLModel subclassing Model. A Model is broarder and can be a 3D model of a molecule (which is a fit of real world data) or a Model describing some phenomenon like a pathway model.

danbri · 2022-07-22T18:07:23Z

We also have

https://schema.org/ProductModel

https://schema.org/3DModel

Latter is close but defined as a MediaObject

egonw · 2022-07-22T18:14:37Z

Ah, nice! 3DModel will work well for chemistry. I still like to see that Model superclass :)

julien-c · 2022-08-10T09:50:49Z

Hi!

I build @huggingface, a platform where the machine learning community can host and collaborate on machine learning models and datasets.

I support the addition of a MLModel type which we could adopt on our model pages e.g. https://huggingface.co/bigscience/bloom

Potential data we could encode in a type (some of which we already validate in a structured way on our side):

datasets used to train or fine-tune the model, as mentioned by @danbri
eval results, i.e. metrics achieved on the datasets used in training (e.g. "this model achieves .85 F1-score on this document classification dataset")
file formats of the weights files, or (related) compatibility to machine learning frameworks like tensorflow or pytorch
compatibility with which machine learning library (e.g. transformers, asteroid, spaCy, timm, etc.)
examples of inputs/outputs
modality (text, audio, images, etc.)
(for NLP models) supported languages
carbon emissions generated when training and/or when used in inference

danbri · 2022-08-10T17:05:56Z

@julien-c that is great! Thanks for jumping in. Some of that ought to be expressible already. E.g. we have fileFormat, and we aready have SoftwareSourceCode and SoftwareApplication, e.g. former giving us a runtimePlatform & codeRepository properties.

For the more advanced information listed here I'd hope to identify some effort to use/consume that data, and find cases where multiple model-hosting sites have the same information.

My colleague Natasha Noy and I have been debating whether it is counter-productive to call ML models "datasets" in some general sense, and how to deal with this. We agree that in the ML context that is confusing terminology, whereas Schema.org has a much more expansive notion of "dataset" (covering non-quantitative research artifacts etc.), so it is awkward to exclude. One option here could be some kind of analogy with the way we (and DCAT) represent dataset distribution as "DataDownload", i.e. very specific concrete bundles of bytes, and represented in the MediaObject type hierarchy (alongside ImageObject, VideoObject). If we were to put MLModel there it would need some convention/property for linking it from Dataset. Different versions (e.g. smaller for mobile on-device use, or older, ...) of a would be different MLModel media objects, whereas if we stick with the idea of MLModel as a subtype of Dataset then a single dataset could be an umbrella construct that referenced such versions as DataDownloads.

I believe that either MLModel as a subtype of Dataset, or MLModel as a subtype of MediaObject, ... could be made to work.

I suggest we proceed by putting something basic into the next release of Schema.org as a foundation to explore from...

danbri · 2022-08-11T17:24:29Z

A twist on this design:

MLModel is just another subtype of CreativeWork
Similarly to the situation we have with things that are usefully described as a Book and also as a Product, anyone who is describing an MLModel that they think is usefully considered a "Dataset" can simply use multiple typing.
While there is an expansive view of the notion of a "dataset" that encompasses ML models, pretty much anything digital can be handled as a dataset if the situation arises. This is comparable to "Product" being anything at all offered for sale, or art being anything you can get hung on a wall in an art gallery, etc.

It is important for schema changes to be driven by commitments of someone to build user-facing features that use the proposed vocabulary at scale. Without this we tend to go in circles, lacking guidance. I have spoken with colleagues at Google who are responsible for Dataset Search, and they see some potential for this to be useful by enriching the information available about a dataset. For example, indicating MLModels out there have been trained on that dataset.

For the case of supervised learning trained on a dataset, the path looks very straightforward. I would like to take a moment to look at situations beyond that, e.g. Reinforcement Learning, where the model is trained against runs of a simulation of some kind. For example https://huggingface.co/format37/BipedalWalker-v3 is using OpenAI Gym in python, which has dependencies on Numpy, Box2D etc. How do MLModels indicate non-data environments critical to their creation? How precise is appropriate here, ... docker images, records of random seeds etc.? Are we running into situations that would be better handled separately for all software packages / systems?

HughP · 2022-08-11T18:09:19Z

I’m still struggling to see MLModels as datasets any more than an MS word document is a Dataset. In my work with ML I’ve always seen these as an abstraction of a dataset (the set of data used to train the model). In this sense then the training data and the MLmodel are independent creative works. Can someone link me to an explanation on how the thing-ness of an MLModel qualifies as a dataset? Respectfully, - Hugh

On Thu, Aug 11, 2022 at 1:24 PM Dan Brickley ***@***.***> wrote: A twist on this design: - MLModel is just another subtype of CreativeWork - Similarly to the situation we have with things that are usefully described as a Book and also as a Product, anyone who is describing an MLModel that they think is usefully considered a "Dataset" can simply use multiple typing. - While there is an expansive view of the notion of a "dataset" that encompasses ML models, pretty much anything digital can be handled as a dataset if the situation arises. This is comparable to "Product" being anything at all offered for sale, or art being anything you can get hung on a wall in an art gallery, etc. It is important for schema changes to be driven by commitments of someone to build user-facing features that use the proposed vocabulary at scale. Without this we tend to go in circles, lacking guidance. I have spoken with colleagues at Google who are responsible for Dataset Search <https://datasetsearch.research.google.com/>, and they see some potential for this to be useful by enriching the information available about a dataset. For example, indicating MLModels out there have been trained on that dataset. For the case of supervised learning trained on a dataset, the path looks very straightforward. I would like to take a moment to look at situations beyond that, e.g. Reinforcement Learning, where the model is trained against runs of a simulation of some kind. For example https://huggingface.co/format37/BipedalWalker-v3 is using <https://github.com/openai/gym/blob/master/gym/envs/box2d/bipedal_walker.py> OpenAI Gym in python, which has dependencies on Numpy, Box2D etc. How do MLModels indicate non-data environments critical to their creation? How precise is appropriate here, ... docker images, records of random seeds etc.? Are we running into situations that would be better handled separately for all software packages / systems? — Reply to this email directly, view it on GitHub <#3140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAJ2JRK62HSZSSJAACN4KTVYUZNPANCNFSM54EUMJTA> . You are receiving this because you commented.Message ID: ***@***.***>

-- All the best, -Hugh

…

Sent from my iPhone

mprorock · 2022-10-04T17:20:21Z

ML has a lot of assets that are really interesting and potentially useful to have better defined in schema.org - for sure datasets, and ability to set types around datasets in a better manner - comment from hugging face above are in line with our thoughts. Vectors and weights are also highly useful. With models, model metadata may be very valuable - e.g. model purpose, known biases, links to training data, etc all could be very valuable

github-actions · 2023-01-03T02:24:41Z

This issue is being nudged due to inactivity.

VladimirAlexiev · 2023-02-08T10:36:44Z

We're working on https://github.com/enRichMyData/InnoGraph, a KG of the AI innovation ecosystem, and ML Models are one of the kinds of entities that we are after.
I welcome the addition of such class to schema.org.

For me Dataset means some machine-processable data that you can download.

It's one of the 4 main types of results being tracked in Science KGs: publications, patents, datasets, software.
(In CERIF you also track research facilities and products).
DCAT makes it quite clear what's a Dataset, Distribution, etc
So @HughP: for me, MLModel is a subclass of Dataset. Yes, it has specialized provenance/dependencies as explained by @julien-c and tracked in @huggingface, and "data" is a key input in producing a MLModel. But that doesn't mean the MLModel itself isn't "data" (a Dataset)
@egonw: what is the common information between 3DModel and MLModel that you'd put in a superclass Model?

There's a lot of things that can be tracked about ML, as explained by @julien-c and as exemplified eg at

Which of these attributes should we elaborate about MLModel? Pinged the above orgs in https://twitter.com/valexiev1/status/1623270940258758656

egonw · 2023-02-10T20:22:14Z

what is the common information between 3DModel and MLModel that you'd put in a superclass Model?

the data it was modeled or derived from
something it represents or predicts

VladimirAlexiev · 2023-02-26T12:59:42Z

@EmidioStani through twitter:
https://semiceu.github.io/MLDCAT-AP/releases/1.0.0/ "developed in collaboration with OpenML".

Based on this UML model:

Emidio, who is your OpenML contact, is it @joaquinvanschoren?
I got some notes on OpenML RDF metadata, and the Expose ontology
- http://www.openml.org/downloads/expose.owl (I got some critiques, if they will be of interest)
  - https://www.slideshare.net/JoaquinVanschoren/expos-ontology (91 slides, 20 Jan 2019)
- https://github.com/ML-Schema, https://github.com/ML-Schema/core
- OpenML metadata uses some OML ontology, neither ml-schema nor Expose
@julien-c can you comment?

In addition to the above-mentioned, I got some notes about these, in case we want to go even deeper:

BKEF, SEWEBAR, EasyMiner.eu- explainability
e-LICO, DMEX, DMOP, DMKB
EDAM
FAIRNets
MEX, METArchive
ML-Schema
OntoDM, KDD
RapidMiner

And I have this "ML interop levels" diagram, not sure from where:

VladimirAlexiev · 2023-02-26T14:28:29Z

@EmidioStani @joaquinvanschoren
I posted a bunch of issues at https://github.com/SEMICeu/MLDCAT-AP/issues.
The most important is SEMICeu/MLDCAT-AP#15: can OML describe ML over semantic datasets?

VladimirAlexiev · 2023-06-22T10:50:40Z

what is the common information between 3DModel and MLModel that you'd put in a superclass Model?
@egonw: the data it was modeled or derived from; something it represents or predicts

A 3DModel usually represents a real-world object (natural, designed, or built). It's often "digitally born" (eg in a CAD system), or could be derived eg from a point-cloud.
A MLModel predicts input/output relations, it's almost always based on some training data, that can be anything in nature (entities+attributes, numeric series, text, speech, images, video).

I don't see anything in common apart from the word "Model". Given that schema.org has refused to create class Agent (union of Person+Organization), the chance it will agree to create class Model is zero.

danbri self-assigned this Jul 20, 2022

danbri added the Queued for Editorial Work Editor needs to turn issues/PRs into final code and release notes. label Aug 16, 2022

github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Jan 3, 2023

VladimirAlexiev mentioned this issue Dec 20, 2023

Integration of sparql with large language model related functionality w3c/sparql-dev#193

Open

VladimirAlexiev mentioned this issue May 28, 2024

Support for Vectors and Matrices w3c/sparql-dev#163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140

add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140

add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140

add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140

Comments