Open sourcing projects is one of the best ways to drive innovations from across the community and take it away from the hands of big-tech. But in the past six years, the contribution of big-tech in the open source GitHub community has increased fourfold with Google taking over Microsoft and IBM and Amazon joining the race. There are lots of new open source projects by Meta and the recent innovator, Stability AI, joining the club as well.
Large language models like GPT-3 and text-to-image models like DALL-E had left developers waiting for their open source alternatives that they could get their hands on.
Check out this list of top datasets and projects that were open sourced in 2022 for further contributions and development.
Also Read: 12 Most Popular Open-Source Projects on GitHub
In June, an early, open source version of BLOOM language model was released by BigScience. It is one of the unique multilingual LLMs trained by the largest collaboration of AI researchers and has 176 billion parameters, which is one billion larger than OpenAI’s GPT-3. It generates text in 46 natural languages which can be coded in 13 programming languages.
Click here to check it out.

When text-to-image generators were on the rise with DALL-E and others, developers wanted to try them out their own. In August, Stability AI announced the public release of Stable Diffusion under the Creative ML OpenRAIL-M licence. 
Click here to check it out.
In June, Meta announced the release of Open Pre-trained Transformer (OPT-66B), which was then one of the largest open-source models to date. It also released the logbooks for training all their baselines with 125M through 66B. This came after Meta released their OPT-175B language along with smaller open-source alternatives for the same.
Click here to check out the repository.
Google’s Attention Center is a TensorFlow Lite model that is used for predicting the focus attention point of an image, where the most important and attractive parts of an image lie. You can use a Python script to batch encode images using the attention centres.
Click here to check out the repository on GitHub.
CORD-19, or COVID-19 Open Research Dataset, is a corpus of academic research papers about COVID-19. On June 2, the final version of the corpus was released after it was being updated weekly since March 2020. The host of the GitHub repository has cleaned the data for furthering NLP research efforts. 
Click here for the GitHub repository.
Read: Top 10 Indian Government Datasets
In June, IBM released its synthetic dataset of user records useful for demonstrating discovery, measurement, and mitigating bias in advertising. The dataset includes individual data of specific users and feature attributes like the gender, age, income, parental status, home ownership, and more. 
Check out the release of the dataset here.
FarmVibes.AI algorithms are run on Microsoft’s Azure for predicting the ideal amounts of fertiliser and herbicide. When Microsoft open-sourced their ‘Project FarmVibes’, a suite for farm-focused technologies which is an AI-powered toolkit for guiding decisions in farming. The multi-modal GeoSpatial ML also has an inference engine.
Click here to check out their blog and here for GitHub repository.
Amazon Sustainability Data Initiative (ASDI) partnered with NASA to accelerate research and innovation in sustainability by providing an open dataset for anyone. In addition to this, the partners are also providing grants to those who are interested in exploring the technology for solving long-term sustainability problems using the provided data. 
Click here to know more.
Introduced in 2017, Google Vizier is an internal service for performing black-box optimization that became the de-facto parameter tuning engine for Google. In July, the company decided to open source it as a standalone Python implementation. Google developed OSS Vizier as a service enabling users to evaluate Trails while also collecting metric and data over time.
In June, Google Brain open sourced the Switch Transformer models that included 1.6 trillion param Switch-C along with the 395 billion param Switch-XXL in T5X. It is a modular research friendly framework for high performance and highly configurable inference models at many scales. 
Click here to check out the repository. 
US-based Neural Magic collaborated with Intel Corporation to develop a ‘pruned’ version of the BERT-Large for achieving higher performance in less storage space and open sourced it on HuggingFace in July. 
Click here to learn more.
A dataset of almost 61.4 million images that are annotated with image-level labels, object segmentation masks, object bounding boxes, and visual relationship, Open Images V7 is the latest update of the dataset useful for computer vision tasks. 
Click here to check it out.
Read: Top 9 Indian Open-source Projects in 2022
Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023
Early Bird Passes expire on 3rd Feb
Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023
Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023
Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023
Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023
Discover special offers, top stories, upcoming events, and more.
Stay Connected with a larger ecosystem of data science and ML Professionals
AI inventions are ready to get patents, but legally, they are not ‘natural humans’ and cannot receive any rights. With DABUS receiving a fresh rejection from EPO, the debate continues.
XOps has emerged as the umbrella term for defining a combination of IT disciplines such as DevOps, DevSecOps, AIOps, MLOps, GitOps, and BizDevOps
AI systems such as Eva can prevent global travel restrictions by detecting passengers who are at risk of coronavirus infection beforehand.
Their tutorial on active learning in ML breaks down the principles of the concept along with real-life examples and mathematical explanations.
Explainability in machine learning refers to the process of explaining a machine learning model’s decision to a human. The term “model explainability” refers to the ability of a human to understand an algorithm’s decision or output.
masked image modelling can provide competitive results to the other approaches like contrastive learning. Performing computer vision tasks using masked images can be called masked image modelling.
AI-based chatbots have shown promise in combating vaccine hesitancy but need wider deployment.
AWS has been inconspicuous when it comes to collaboration with similar AI research labs. Does this pose a major disadvantage when compared to Google and Microsoft?
We ask industry experts if tech firms should adopt a 2 weeks notice period
For now, the Metaverse’s offerings are as good as a mere rebranding of popular platforms.
© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2023