[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140

Open
danbri opened this issue Jul 20, 2022 · 15 comments
Assignees
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). Queued for Editorial Work Editor needs to turn issues/PRs into final code and release notes.

Comments

@danbri
Copy link
Contributor
danbri commented Jul 20, 2022

Dataset is pretty vague, it can cover anything from .zip files of .wavs of social science interviews, application-specific on-disk file formats, etc etc. In theory we could have subtypes for all of these, but it would be a neverending task, so we have tended to address such needs through indirect mechanisms like https://schema.org/fileFormat

The practice of publishing and consuming ML models is different, and worth exploring a special subtype, to allow consuming services to differentiate machine learning models from other unrelated things. This is partially to do with the massive attention associated with "ML", "AI" etc., but also relates to the relatively inscrutable nature of trained ML models. You get something that can map inputs to outputs or act as a controller of an agent of some kind, ... but how it works internally is relatively unclear. This is in stark contrast to the most stereotypical kinds of Datasetty Dataset, i..e. where you have the rows and columns of descriptive tables, and some reasonable clarity on how the parts of the dataset relate to the whole.

In many ways machine learning models are more like software artifacts than they are like (for example) "open data" CSV files. Historically Schema.org hasn't included https://schema.org/SoftwareSourceCode or https://schema.org/SoftwareApplication as subtypes of /Dataset, ... although a case could be made to do so. Similarly Docker (etc.) packaging, or Debian/CPAN/NPM etc packages are somewhat related.

This issue comes out of discussions (with colleagues and in wider world) about the scope of "Dataset", and the concern that it will just be super confusing to try to persuade ML folk to markup up models as datasets, whereas in their preferred technology, models are built from datasets.

So -

How about we add (as an exploratory, "Pending" term) a type: MLModel

  • Description: A schema.org MLModel represents a machine learning model. These will typically but not necessarily consist of some data describing the general setup and architecture of the model, and a larger set of parameters. The specific format of the model can be indicated via fileFormat, and might be proprietary, toolkit specific, or adhere to some published standard.

Adding an MLModel type would allow publishers to distinguish models from other datasets more clearly, and in a way that could be used by consuming applications. Recently there has been an increasing recognition of the risks around bias and inappropriate re-use of pre-trained machine learning models, as well as a concern that the high cost of training certain kinds of model makes it important for such models to be shareable, lowering barriers to entry of the ML/AI field. In this climate, it is ideal when data publishers provide not only the information describing a pre-trained machine learning model they're offering, but also the dataset(s) upon which the model was trained. Calling both of these published artifacts "Dataset" may be confusing, ...despite ML models fitting under Schema.org's pretty broad notion of Dataset. My proposal here partially resolves this by adding a convenience type. For those who care to dig they might find out that MLModel is considered a special kind of Dataset, but most won't even encounter this information.

There's more to say on this but I've written too much already, and wanted to start some discussion. Opinions welcomed!

@danbri danbri self-assigned this Jul 20, 2022
@HughP
Copy link
HughP commented Jul 20, 2022

I concur that dataset in schema.org is vague. And that an MLModel is not a dataset, but an abstraction of patterns found in a data stimulus. The term dataset is also hard to understand (or often misunderstood) in the dublin core world as well. From an archivist's perspective (even a digital archivist's perspective), I'd like to point out there is a difference between a collection and a dataset. A collection is a set of items (files) associated for some reason, whereas a dataset is curated, arranged, and designed to be computer read as a coherent data structure. This applies to to things like a collection of .wav files. Rarely in academic preservation contexts are .wav files a data set. Though it is possible to write code to find structure in anything. Ideally, a dataset would conform to an arrangement specification.

@egonw
Copy link
Contributor
egonw commented Jul 22, 2022

A Dataset is something else than a machine learned model. I also agree there is something missing in schema.org for 'model'. I would suggest, however, to introduce Model instead of the more specific MLModel. Or perhaps both, with MLModel subclassing Model. A Model is broarder and can be a 3D model of a molecule (which is a fit of real world data) or a Model describing some phenomenon like a pathway model.

@danbri
Copy link
Contributor Author
danbri commented Jul 22, 2022

We also have

https://schema.org/ProductModel

https://schema.org/3DModel

Latter is close but defined as a MediaObject

@egonw
Copy link
Contributor
egonw commented Jul 22, 2022

Ah, nice! 3DModel will work well for chemistry. I still like to see that Model superclass :)

@julien-c
Copy link

Hi!

I build @huggingface, a platform where the machine learning community can host and collaborate on machine learning models and datasets.

I support the addition of a MLModel type which we could adopt on our model pages e.g. https://huggingface.co/bigscience/bloom

Potential data we could encode in a type (some of which we already validate in a structured way on our side):

  • datasets used to train or fine-tune the model, as mentioned by @danbri
  • eval results, i.e. metrics achieved on the datasets used in training (e.g. "this model achieves .85 F1-score on this document classification dataset")
  • file formats of the weights files, or (related) compatibility to machine learning frameworks like tensorflow or pytorch
  • compatibility with which machine learning library (e.g. transformers, asteroid, spaCy, timm, etc.)
  • examples of inputs/outputs
  • modality (text, audio, images, etc.)
  • (for NLP models) supported languages
  • carbon emissions generated when training and/or when used in inference

@danbri
Copy link
Contributor Author
danbri commented Aug 10, 2022

@julien-c that is great! Thanks for jumping in. Some of that ought to be expressible already. E.g. we have fileFormat, and we aready have SoftwareSourceCode and SoftwareApplication, e.g. former giving us a runtimePlatform & codeRepository properties.

For the more advanced information listed here I'd hope to identify some effort to use/consume that data, and find cases where multiple model-hosting sites have the same information.

My colleague Natasha Noy and I have been debating whether it is counter-productive to call ML models "datasets" in some general sense, and how to deal with this. We agree that in the ML context that is confusing terminology, whereas Schema.org has a much more expansive notion of "dataset" (covering non-quantitative research artifacts etc.), so it is awkward to exclude. One option here could be some kind of analogy with the way we (and DCAT) represent dataset distribution as "DataDownload", i.e. very specific concrete bundles of bytes, and represented in the MediaObject type hierarchy (alongside ImageObject, VideoObject). If we were to put MLModel there it would need some convention/property for linking it from Dataset. Different versions (e.g. smaller for mobile on-device use, or older, ...) of a would be different MLModel media objects, whereas if we stick with the idea of MLModel as a subtype of Dataset then a single dataset could be an umbrella construct that referenced such versions as DataDownloads.

I believe that either MLModel as a subtype of Dataset, or MLModel as a subtype of MediaObject, ... could be made to work.

I suggest we proceed by putting something basic into the next release of Schema.org as a foundation to explore from...

@danbri
Copy link
Contributor Author
danbri commented Aug 11, 2022

A twist on this design:

  • MLModel is just another subtype of CreativeWork
  • Similarly to the situation we have with things that are usefully described as a Book and also as a Product, anyone who is describing an MLModel that they think is usefully considered a "Dataset" can simply use multiple typing.
  • While there is an expansive view of the notion of a "dataset" that encompasses ML models, pretty much anything digital can be handled as a dataset if the situation arises. This is comparable to "Product" being anything at all offered for sale, or art being anything you can get hung on a wall in an art gallery, etc.

It is important for schema changes to be driven by commitments of someone to build user-facing features that use the proposed vocabulary at scale. Without this we tend to go in circles, lacking guidance. I have spoken with colleagues at Google who are responsible for Dataset Search, and they see some potential for this to be useful by enriching the information available about a dataset. For example, indicating MLModels out there have been trained on that dataset.

For the case of supervised learning trained on a dataset, the path looks very straightforward. I would like to take a moment to look at situations beyond that, e.g. Reinforcement Learning, where the model is trained against runs of a simulation of some kind. For example https://huggingface.co/format37/BipedalWalker-v3 is using OpenAI Gym in python, which has dependencies on Numpy, Box2D etc. How do MLModels indicate non-data environments critical to their creation? How precise is appropriate here, ... docker images, records of random seeds etc.? Are we running into situations that would be better handled separately for all software packages / systems?

@HughP
Copy link
HughP commented Aug 11, 2022 via email

@danbri danbri added the Queued for Editorial Work Editor needs to turn issues/PRs into final code and release notes. label Aug 16, 2022
@mprorock
Copy link
mprorock commented Oct 4, 2022

ML has a lot of assets that are really interesting and potentially useful to have better defined in schema.org - for sure datasets, and ability to set types around datasets in a better manner - comment from hugging face above are in line with our thoughts. Vectors and weights are also highly useful. With models, model metadata may be very valuable - e.g. model purpose, known biases, links to training data, etc all could be very valuable

@github-actions
Copy link
github-actions bot commented Jan 3, 2023

This issue is being nudged due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Jan 3, 2023
@VladimirAlexiev
Copy link
VladimirAlexiev commented Feb 8, 2023

We're working on https://github.com/enRichMyData/InnoGraph, a KG of the AI innovation ecosystem, and ML Models are one of the kinds of entities that we are after.
I welcome the addition of such class to schema.org.

For me Dataset means some machine-processable data that you can download.

  • It's one of the 4 main types of results being tracked in Science KGs: publications, patents, datasets, software.
    (In CERIF you also track research facilities and products).
  • DCAT makes it quite clear what's a Dataset, Distribution, etc
  • So @HughP: for me, MLModel is a subclass of Dataset. Yes, it has specialized provenance/dependencies as explained by @julien-c and tracked in @huggingface, and "data" is a key input in producing a MLModel. But that doesn't mean the MLModel itself isn't "data" (a Dataset)
  • @egonw: what is the common information between 3DModel and MLModel that you'd put in a superclass Model?

There's a lot of things that can be tracked about ML, as explained by @julien-c and as exemplified eg at

Which of these attributes should we elaborate about MLModel? Pinged the above orgs in https://twitter.com/valexiev1/status/1623270940258758656

@egonw
Copy link
Contributor
egonw commented Feb 10, 2023

what is the common information between 3DModel and MLModel that you'd put in a superclass Model?

  • the data it was modeled or derived from
  • something it represents or predicts

@VladimirAlexiev
Copy link
VladimirAlexiev commented Feb 26, 2023

@EmidioStani through twitter:
https://semiceu.github.io/MLDCAT-AP/releases/1.0.0/ "developed in collaboration with OpenML".

  • Based on this UML model:


In addition to the above-mentioned, I got some notes about these, in case we want to go even deeper:

  • BKEF, SEWEBAR, EasyMiner.eu- explainability
  • e-LICO, DMEX, DMOP, DMKB
  • EDAM
  • FAIRNets
  • MEX, METArchive
  • ML-Schema
  • OntoDM, KDD
  • RapidMiner

And I have this "ML interop levels" diagram, not sure from where:
image

@VladimirAlexiev
Copy link

@EmidioStani @joaquinvanschoren
I posted a bunch of issues at https://github.com/SEMICeu/MLDCAT-AP/issues.
The most important is SEMICeu/MLDCAT-AP#15: can OML describe ML over semantic datasets?

@VladimirAlexiev
Copy link

what is the common information between 3DModel and MLModel that you'd put in a superclass Model?
@egonw: the data it was modeled or derived from; something it represents or predicts

  • A 3DModel usually represents a real-world object (natural, designed, or built). It's often "digitally born" (eg in a CAD system), or could be derived eg from a point-cloud.
  • A MLModel predicts input/output relations, it's almost always based on some training data, that can be anything in nature (entities+attributes, numeric series, text, speech, images, video).

I don't see anything in common apart from the word "Model". Given that schema.org has refused to create class Agent (union of Person+Organization), the chance it will agree to create class Model is zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). Queued for Editorial Work Editor needs to turn issues/PRs into final code and release notes.
Projects
None yet
Development

No branches or pull requests

6 participants