Is the size of a dataset a technology in itself?

Jul 06, 2024

In a previous post (digital junk yard for the win) I wrote about how all companies make a huge set of decisions over time based on what is feasible technically or economically at the time. As time changes those limitations are lifted, but old organisations never realise this as there is no one responsible for looking at fundamental issues with existing main stream products. This is a well known problem that no one usually does anything about.

Today that problem is lifted as LLM based chatbots can be used to discover stuff from old material as long as it is saved somehow. And that material can also be audio recordings of past decision meetings. See the post for full story.

Now consider data and specifically the strange way that value of data changes as its grows.

An individual recording of temperature for example has no value but as we get more data points, there is an inflection point where suddenly a new use case becomes possible. Use cases like weather predictions or just visualising temperature changes over time in the same area for informational purposes. The value of data goes from zero to something. And as data set grows the predictive power tends to grow as well indicating that the data set as whole becomes more valuable. But as the data set grows, there is a diminishing return as customers have hard limits how much they are willing to pay based on how important accurate results are to them on that one use case.

You can have several such bumps where with growth of data sets something new unexpectedly becomes again possible. A good example is with images. Images have been for a long time been used to detect all kinds of unwanted facts like plant diseases, diagnose medical conditions or types of faults in technical whatnots etc. We’ve recently crossed an inflection point where the data sets of images are big enough that good quality images can be automatically generated just from very simple text prompts. And currently where we are on the growth curve where the value is increased by the day as training data sets grow. To a point where already the results can fool us. Technology can also be used for nefarious purposes to generate fake images. In this case still having great value but in the wrong direction.

Technology can be defined as a set of tools and methods to solve problems, improve efficiency, and enhance human capabilities.

We’ve just seen that data sets at different sizes are new ways to solve problems, They rely on underlying ML techniques but one can certainly view data set sizes itself as a technology. So you could define data set size as a kind of technology itself.

Whether you agree defining data set size as a technology or view it is just a half-clever twisting of words, we most likely agree the large data sets unlock new use cases and the bigger the data set the more accurate the results are for some problem domains. This is not by the way universal truth. Sometimes with big datasets the model can focus too much on specific details leading to models that do not generalise well. This is called overfitting and can occur for example if the machine learning model being trained is too complex with many layers and parameters.

If you accept this as truth irrespective of words being used, many current things become easy to understand

Scraping, Regulation

Anthropic and OpenAI are at the moment scraping the whole web ignoring robots.txt policies. Robots.txt is a text file on a website that gives instructions for web robots, for example asking them not to read the site. Its like “no trespassing” for web. There is no law to obey it but for 25 years everyone else has followed this guideline.
Same companies that have now captured much of the web (some estimates that almost all of public web is now crawled by leading AI corporations), are hard in their lobbying efforts trying to scare policy makers how dangerous AI is to get them to place regulations preventing everyone else to enter.
European Union – the superpower of no – already swallowed the hook, line, and sinker and has passed the AI Act that will come into full effect in 24 months (final approval was 21 May 2024).
AI Act is comprehensive meaning it is complex and its not clear what is allowed or not. This has caused even megacorps like Apple to delay introducing new AI features in EU are before dialogue with European Commission.
Small companies operating in the EU region do not have the possibility to individually consult the EU.
Same rules apply to megacorporations as to 3 person company in the middle of nowhere in Finland.
As the regulatory situation is unclear, very few are willing to invest in small EU AI companies. Companies need to falter or work under uncertainty. New ones are started somewhere else.
This uncertainty in legislation is not an error but intentional. It allows to dynamically adjust what is allowed or not. Everyone can be seen as a criminal by default and current political constellations in power get to decide on the spot what the rules are, who gets a bucket of legal difficulties and who prospers under the tender digital skies. Very convenient in the short term, price is the long term future.
All of these facts favour large megacorporations as their services are largely used already and backlash would be too big.

Risks with ML model backdoors

Backdooring ML models is possible. Backdooring means that with certain carefully constructed inputs the model has totally new behaviour.

Read Bruce Schneier’s blog post for the full story

And the original paper is “Planting Undetectable Backdoors in Machine Learning Models by Shafi Goldwasser et. Al, 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science”, if you have access to such documents.

Training own ML models is expensive, in EU the legal status is often unclear and training requires somewhat rare expertise. This means companies will outsource training to external Machine learning as a service (MLaaS) companies, or use existing models in their services or in the future ML models will be embedded into a multitude of products like security cameras that companies deploy.

Extenal models will require customers to trust the model maker as there opportunities for abuse. This is why development of own AI models under own jurisdiction by people who use the results of those same models in their everyday life will be critically imporant.

The paper referred above shows multiple backdooring strategies that have strong guarantees of undetectability based on standard cryptographic assumptions. One of them can backdoor any given classifier without access to the training dataset (black-box); and the other ones run the honest training algorithm, except with cleverly crafted randomness which acts as initialisation to the training algorithm (white-box backdooring).

Basically its impossible to detect if a ML model is backdoored and this gives all kinds of possibilities.

As example, assume that security cameras have an optimisation mode where they only stream to cloud (and record locally as backup) when there is movement (to save bandwidth as customers may have a whole fleet of high quality cameras). A backdoored camera could work so that certain patterns on clothing inhibit this and the person moving about becomes digitally invisible. Very convenient in certain types of activities.

Backdooring opens up a whole Pandora’s box of options that this writer is not smart enough to think of at the moment.

As large datasets have opened and will open new key use cases, every company will employ models either in their services or some physical products being used. Backdooring is yet another reason why model training needs to be done mostly locally and to enable this, legislation must give clarity to participants.

Sidenote: Headache for scrapers

Now all is lost to the scrapers yet (although much is). Artists can now start testing tools that make small modifications to online images to confuse machine learning training algorithms so they cannot copy the artists personal style. Glaze is one of the recent ones

Schrodinger Mind by Martti Ylikoski

Discussion about this post