Data is one of the most important aspects in generative AI, including the likes of DALL.E, Midjourney, Stable Diffusion, and others, alongside large language models like GPT, PaLM, and more, that are trained on tens or hundred billions parameters. Most people in AI believe that increasing the size and quality of datasets is the only way forward, and quibble about increasing the data flywheel for training their AI models. But, it’s time that they started looking beyond data scarcity and LLMs to create intelligent AI systems. 
Speaking to Analytics India Magazine, Yoshua Bengio, also agreed that “the bigger, the better” logic for AI is good, but not feasible in the long run. He said that by taking the latest architectures and simply scaling computer power, along with the hopes of increasing data, is a brute force technique and not one that tackles all other problems.
For quite some time now, there has been a debate about quality versus quantity of data that are being used to create AI models and deploy them. Google’s Francois Chollet said that increasing the size of data instead of the quality can ruin the models instead of tuning them better. 
It is also important to note that larger language models can lead to poor data quality and more fine tuning of parameters and heavy workloads, while smaller datasets can lead to more biases and lesser fine-tuning and minimal computational resources. But, the bias problem remains unaddressed, which can probably be solved using a multi-modal approach. A multi-modal approach refers to the use of multiple models or methods to achieve a goal or solve a problem. 
For instance, Yann LeCun suggests a different approach of using multi-modal to solve a single problem mimicking an animal’s brain. In this case, he proposes to use a configurator module, perception module, world module, cost module, short-term module, and actor module.
One of the redditors, who goes by the user name ‘Top-Avocado-2564,’ recently said that the current AI/ML systems are built in a way that they require large amounts of data. This happens because of a lack of diversity in deep learning research. The research is led by big tech companies that have the infrastructure and computing capabilities to support large volumes of data for computing. Therefore, there is a unidirectional understanding that if we can increase the amount of data, we can probably solve the current problems or limitations. 
Further, in the thread, another reddit user by the name, ‘piyabati’ said if we move back a decade and look at how the hardware limitations had led researchers to run comparatively smaller datasets enabling models to operate and infer with a degree of freedom, though it’d predicted incorrect results sometimes. When the hardware improved, researchers had a lot of labelled data to work with, which allowed improvements in the models.
Now, this has led to companies believing that increasing datasets and making them larger, without actually changing or improving the scientific understanding, is enough to make progress in AI. This tells us that the ‘data scarcity’ is as big of a problem as ‘lack of diversity’ in using the available data. 
Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data. When we compare the workings of a machine to a human being— essentially the goal— there is an obvious difference in training. Humans do not require the knowledge of a billion words to form a sentence like machines do. 
Been Kim, research scientist at Google Brain, said that science and engineering should go hand-in-hand. There is no grand unified theory which can assess when, or if even, a machine has become conscious. Until then we have to rely on mathematical optimisation and build machines that are a decimal percentage better than the last one. 
If you look at the present day AI systems like ChatGPT or DALL-E 2, they were not built with the intention of solving a specific problem. Their goal was to attempt to take steps towards building machines that can be trained on large amounts of data to produce ‘statistically’ and ‘mathematically’ better output, and nothing close to human intelligence. In other words, ‘human-task-imitating-machines,’ and not ‘human-like machines’. 
However, in the last few years, we have seen GPT-like models being used for healthcare, or solving protein-fold prediction problems. These are some examples of Narrow AI, when AI is built to focus on specific problems or use cases. This is most likely to be one of the plausible approaches, instead of a broader approach. 
Arguably, generative models like ChatGPT and DALL-E might be good for fun and entertainment purposes, but they fall short in the greater scheme of things. Examples of AI solving a specific issue, for example climate change, healthcare industry, or developing industrial AI, are somewhat missing. In addition, for any of these examples, the scarcity of data cannot be the limiting factor here.
Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023
Early Bird Passes expire on 3rd Feb
Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023
Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023
Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023
Conference, in-person (Bangalore)
Cypher 2023
27-29th Sep, 2023
Discover special offers, top stories, upcoming events, and more.
Stay Connected with a larger ecosystem of data science and ML Professionals
Masking is a process of hiding information of the data from the models. autoencoders can be used with masked data to make the process robust and resilient.
In 2022, the job aspirant, along with possessing the right skills, has to push their boundaries to set themselves apart from the crowd, to bag their dream roles
Image matting is a very useful technique in image processing which helps in extracting a targeted part of the image.
The augmented workforce era is upon us, with more and more organisations opting to
With the rise in the need for developers in business models, there also is a need for platforms where developers can connect and learn
We look at the brightest AI-based innovations that are being presented at CES this year.
In machine learning, ensemble approaches combine many weak learners to achieve better prediction performance than each of the constituent learning algorithms alone.
Most advanced machine learning models based on CNN can now be easily fooled by very small changes to the samples on which we are going to make a prediction, and the confidence in such a prediction is much higher than with normal samples.
The company revealed having delivered about a million cars in 2021, an estimated 100% increase from 2020, despite the semiconductor crisis limiting other companies.
graph structure has much additional information with them like node attributes, and label information of nodes. Using this source of information, we can have unprecedented opportunities to design advanced level self-supervised pretext tasks
© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2023