• Nigerian/African Tech
  • Start Up
  • Internet
    • App
    • Mobile
    • Software
  • Gadgets
  • Money
  • Video
Tech News, Magazine & Review WordPress Theme 2017
  • Home
  • Africa
  • Business
  • Video
  • Metaverse
  • AI
  • Gadgets
  • Earnings
  • Tips
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
  • Home
  • Africa
  • Business
  • Video
  • Metaverse
  • AI
  • Gadgets
  • Earnings
  • Tips
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
TechBooky
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Home Artificial Intelligence

AI Research SuperCluster — Meta’s Cutting-Edge Supercomputer For AI Research

Contributor by Contributor
January 25, 2022
Share on FacebookShare on Twitter

Byline: Kevin Lee; Shubho Sengupta

Also On TechBooky

Shutterstock Introduces Its Generative AI Image Tool

Here Are Key Differences Between Google Search And ChatGPT

Microsoft In Major Shake-Off, Announces The Disengagement Of 10,000 Employees

Kenyan Fintech Startup, Kwara Raises $3 Million Seed Extension

Tunisia AI Startup, InstaDeep Acquired By Germany’s BioNTech For £582m

Developing the next generation of advanced AI will require powerful new computers capable of quintillions of operations per second. Today, Meta is announcing that we’ve designed and built the AI Research SuperCluster (RSC) — which we believe is among the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022. Our researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of one day training models with trillions of parameters.

RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more. Our researchers will be able to train the largest models needed to develop advanced AI for computer vision, NLP, speech recognition, and more. We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together. Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.

Why do we need an AI supercomputer at this scale?

Meta has been committed to long-term investment in AI since 2013, when we created the Facebook AI Research lab. In recent years, we’ve made significant strides in AI thanks to our leadership in a number of areas, including self-supervised learning, where algorithms can learn from vast numbers of unlabeled examples, and transformers, which allow AI models to reason more effectively by focusing on certain areas of their input.

To fully realize the benefits of self -supervised learning and transformer-based models, various domains, whether vision, speech, language, or for critical use cases like identifying harmful content, will require training increasingly large, complex, and adaptable models. Computer vision, for example, needs to process larger, longer videos with higher data sampling rates. Speech recognition needs to work well even in challenging scenarios with lots of background noise, such as parties or concerts. NLP needs to understand more languages, dialects, and accents. And advances in other areas, including robotics, embodied AI, and multimodal AI, will help people accomplish useful tasks in the real world.

High-performance computing infrastructure is a critical component in training such large models, and Meta’s AI research team has been building these high- powered systems for many years. The first generation of this infrastructure, designed in 2017, has 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day. Up until now, this infrastructure has set the bar for Meta’s researchers in terms of its performance, reliability, and productivity.

In early 2020, we decided the best way to accelerate progress was to design a new computing infrastructure from a clean slate to take advantage of new GPU and network fabric technology. We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte — which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video.

While the high-performance computing community has been tackling scale for decades, we also had to make sure we have all the needed security and privacy controls in place to protect any training data we use. Unlike with our previous AI research infrastructure, which leveraged only open source and other publicly available data sets, RSC also helps us ensure that our research translates effectively into practice by allowing us to include real-world examples from Meta’s production systems in model training. By doing this, we can help advance research to perform downstream tasks such as identifying harmful content on our platforms as well as research into embodied AI and multimodal AI to help improve user experiences on our family of apps. We believe this is the first time performance, reliability, security, and privacy have been tackled at such a scale.

AI supercomputers are built by combining multiple GPUs into compute nodes, which are then connected by a high-performance network fabric to allow fast communication between those GPUs. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs — with each A100 GPU being more powerful than the V100 used in our previous system. The GPUs communicate via an NVIDIA Quantum 200 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.

Early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.

Designing and building something like RSC isn’t a matter of performance alone but performance at the largest scale possible, with the most advanced technology available today. When RSC is complete, the InfiniBand network fabric will connect 16,000 GPUs as endpoints, making it one of the largest such networks deployed to date. Additionally, we designed a caching and storage system that can serve 16 TB/s of training data, and we plan to scale it up to 1 exabyte.

All this infrastructure must be extremely reliable, as we estimate some experiments could run for weeks and require thousands of GPUs. Lastly, the entire experience of using RSC has to be researcher-friendly so our teams can easily explore a wide range of AI models.

A big part of achieving this was in working with a number of long-time partners, all of whom also helped design the first generation of our AI infrastructure in 2017. Penguin Computing, our architecture and managed services partner, worked with our operations team on hardware integration to deploy the cluster and helped set up major parts of the control plane. Pure Storage provided us with a robust and scalable storage solution. And NVIDIA provided us with its AI computing technologies featuring cutting-edge systems, GPUs, and InfiniBand fabric, and software stack components like NCCL for the cluster.

…and doing it remotely, during a pandemic

But there were other unexpected challenges that arose in RSC’s development — namely the coronavirus pandemic. RSC began as a completely remote project that the team took from a simple shared document to a functioning cluster in about a year and a half. COVID-19 and industry-wide wafer supply constraints also brought supply chain issues that made it difficult to get everything from chips to components like optics and GPUs, and even construction materials

— all of which had to be transported in accordance with new safety protocols. To build this cluster efficiently, we had to design it from scratch, creating many entirely new Meta-specific conventions and rethinking previous ones along the way. We had to write new rules around our data centre designs — including their cooling, power, rack layout, cabling, and networking (including a completely new control plane), among other important considerations. We had to ensure that all the teams, from construction to hardware to software and AI, were working in lockstep and in coordination with our partners.

Beyond the core system itself, there was also a need for a powerful storage solution, one that can serve terabytes of bandwidth from an exabyte-scale storage system. To serve AI training’s growing bandwidth and capacity needs, we developed a storage service, AI Research Store (AIRStore), from the ground up. To optimize for AI models, AIRStore utilizes a new data preparation phase that preprocesses the data set to be used for training. Once the preparation is performed one time, the prepared data set can be used for multiple training runs until it expires. AIRStore also optimizes data transfers so that cross-region traffic on Meta’s inter-datacenter backbone is minimized.

 

How we safeguard data in RSC

To build new AI models that benefit the people using our services — whether that’s detecting harmful content or creating new AR experiences — we need to teach models using real-world data from our production systems. RSC has been designed from the ground up with privacy and security in mind, so that Meta’s researchers can safely train models using encrypted user-generated data that is not decrypted until right before training. For example, RSC is isolated from the larger internet, with no direct inbound or outbound connections, and traffic can flow only from Meta’s production data centers.

To meet our privacy and security requirements, the entire data path from our storage systems to the GPUs is end-to-end encrypted and has the necessary tools and processes to verify that.

these requirements are met at all times. Before data is imported to RSC, it must go through a privacy review process to confirm it has been correctly anonymized. The data is then encrypted before it can be used to train AI models and decryption keys are deleted regularly to ensure older data is not still accessible. And since the data is only decrypted at one endpoint, in memory, it is safeguarded even in the unlikely event of a physical breach of the facility.

 

Phase two and beyond

RSC is up and running today, but its development is ongoing. Once we complete phase two of building out RSC, we believe it will be the fastest AI supercomputer in the world, performing at nearly 5 exaflops of mixed precision compute. Through 2022, we’ll work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand.

We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse. Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well.

Related Posts:

  • Meta Is Building The World's Fastest AI Supercomputer
    Meta Is Building The World's Fastest AI Supercomputer
  • Nvidia Team Up With Microsoft To Build Massive AI Supercomputer
    Nvidia Team Up With Microsoft To Build Massive AI…
  • Meta’s AI Machine Translation Research Helps Break Language Barriers
    Meta’s AI Machine Translation Research Helps Break Language…
  • ‘AI Driven Universal Speech Translator’ To Be Developed By Meta
    ‘AI Driven Universal Speech Translator’ To Be Developed By…
  • 54gene, The Nigerian e-Health Company Launches A Fifth-Generation Research Facility In Lagos
    54gene, The Nigerian e-Health Company Launches A…
  • Plans Underway To Develop Android Apps From Texts Descriptions
    Plans Underway To Develop Android Apps From Texts…
  • Intel Wants To Play A Big Role In How We Access The Metaverse
    Intel Wants To Play A Big Role In How We Access The…
  • C Is The Most Environmentally Friendly Programming Language
    C Is The Most Environmentally Friendly Programming Language
Tags: AIartificial intelligencefacebookmetaResearch SuperCluster
Contributor

Contributor

Posts by contributors. You can send in a post to be reviewed and published to info@techbooky.com

BROWSE BY CATEGORIES

Receive top tech news directly in your inbox

Loading

Recent

Tesla Cybertruck Mass Production Won’t Start Until 2024

Tesla Cybertruck Mass Production Won’t Start Until 2024

January 27, 2023
Apple Reportedly Delays Development Of Its Own WiFi Chips

Apple Reportedly Delays Development Of Its Own WiFi Chips

January 27, 2023
Google Commits To Complying With EU Laws On Its Services

Google Commits To Complying With EU Laws On Its Services

January 27, 2023
Airtel Launches Its eSIM Technology In Nigeria

Airtel Launches Its eSIM Technology In Nigeria

January 27, 2023
In Spite Of The Sucess Of Genetically Modified Foods, Debates Abound

In Spite Of The Sucess Of Genetically Modified Foods, Debates Abound

January 27, 2023
How And How Not Gaming Can Be Used In Solving Real Problems

How And How Not Gaming Can Be Used In Solving Real Problems

January 27, 2023
Tesla Sues Former Employee For Allegedly Stealing Trade Secrets

Tesla Made The Most Money In 2022, But Its Future Still Rocky

January 26, 2023
Shutterstock Introduces Its Generative AI Image Tool

Shutterstock Introduces Its Generative AI Image Tool

January 26, 2023
Meta Agrees To $725M Settlement Of Cambridge Analytica Lawsuit

Meta Set To Reinstate Trump’s Facebook And Instagram Accounts

January 26, 2023
Here’s How ChatGPT Can Help Improve Your SEO

Here’s How ChatGPT Can Help Improve Your SEO

January 25, 2023

Browse Archives

January 2023
MTWTFSS
 1
2345678
9101112131415
16171819202122
23242526272829
3031 
« Dec    

About Us

TechBooky

TechBooky is a social Tech blog with a special focus on the budding African Technology sector. TechBooky is currently based in Abuja, Nigeria.

Subscribe to TechBooky

Enter your email address to subscribe to TechBooky and receive notifications of new posts by email.

Join 24 other subscribers.

Receive top tech news directly in your inbox

Loading

Popular Tags

AI (252) amazon (95) android (281) app (610) Apple (473) artificial intelligence (265) business (338) china (113) cloud (135) cryptocurrency (158) ecommerce (109) enterprise (239) facebook (472) gadget (448) gaming (160) google (545) government (381) guest post (108) instagram (137) internet (352) ios (249) iphone (210) microsoft (261) mobile (281) new feature (287) nigeria (276) privacy (135) research (134) samsung (139) security (374) smartphone (235) social media (671) software (415) startup (268) streaming (140) telecom (157) tips (340) transport (104) twitter (216) united states (191) users (132) videos (115) website (159) whatsapp (129) youtube (106)

Quick Links

  • Home
  • Africa
  • Business
  • Video
  • Metaverse
  • AI
  • Gadgets
  • Earnings
  • Tips

Popular Post

  • Trending
  • Comments
  • Latest
Download Free Editable Resume Templates – Word / Docx – 2022

Download Free Editable Resume Templates – Word / Docx – 2022

July 25, 2022
The Best Free PC Games

The Best Free PC Games

July 29, 2022
Recover Permanently Deleted Emails From iCloud Manually

Recover Permanently Deleted Emails From iCloud Manually

March 5, 2022
Resume and Cover letter Templates for free

Resume and Cover letter Templates for free

July 25, 2022
How is Technology Changing Our Definition of What It Means to Be a Human?

How is Technology Changing Our Definition of What It Means to Be a Human?

April 1, 2018
[Fixed] “Outlook Running Slow Windows 10” Issue

[Fixed] “Outlook Running Slow Windows 10” Issue

February 12, 2020
Tesla Cybertruck Mass Production Won’t Start Until 2024

Tesla Cybertruck Mass Production Won’t Start Until 2024

January 27, 2023
Apple Reportedly Delays Development Of Its Own WiFi Chips

Apple Reportedly Delays Development Of Its Own WiFi Chips

January 27, 2023
Google Commits To Complying With EU Laws On Its Services

Google Commits To Complying With EU Laws On Its Services

January 27, 2023
Airtel Launches Its eSIM Technology In Nigeria

Airtel Launches Its eSIM Technology In Nigeria

January 27, 2023
In Spite Of The Sucess Of Genetically Modified Foods, Debates Abound

In Spite Of The Sucess Of Genetically Modified Foods, Debates Abound

January 27, 2023
How And How Not Gaming Can Be Used In Solving Real Problems

How And How Not Gaming Can Be Used In Solving Real Problems

January 27, 2023

Recent News

Tesla Cybertruck Mass Production Won’t Start Until 2024

Tesla Cybertruck Mass Production Won’t Start Until 2024

January 27, 2023
Apple Reportedly Delays Development Of Its Own WiFi Chips

Apple Reportedly Delays Development Of Its Own WiFi Chips

January 27, 2023
Google Commits To Complying With EU Laws On Its Services

Google Commits To Complying With EU Laws On Its Services

January 27, 2023
Airtel Launches Its eSIM Technology In Nigeria

Airtel Launches Its eSIM Technology In Nigeria

January 27, 2023
In Spite Of The Sucess Of Genetically Modified Foods, Debates Abound

In Spite Of The Sucess Of Genetically Modified Foods, Debates Abound

January 27, 2023
How And How Not Gaming Can Be Used In Solving Real Problems

How And How Not Gaming Can Be Used In Solving Real Problems

January 27, 2023
  • About TechBooky
  • Submit Article
  • Advertise Here
  • Contact us
  • Privacy Policy
  • Disclaimer
  • Login

© 2021 Design By Tech Booky Elite

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
  • Home
  • Africa
  • Business
  • Video
  • Metaverse
  • AI
  • Gadgets
  • Earnings
  • Tips

© 2021 Design By Tech Booky Elite