Wikimedia Deutschland announced on Wednesday that a new database will enable AI models to access Wikipedia’s vast amount of knowledge.
This new AI-friendly database is now being added to Wikidata, which will facilitate the information’s assimilation by huge language models. This approach which is also known as the Wikidata Embedding Project, uses a vector-based semantic search, which is a method that aids computers in comprehending the meaning and connections between words, to search through the over 120 million articles that now make up Wikipedia and its sister platforms which in turns transforms the items in Wikidata from clumsily formatted data into vectors that reflect the context and meaning surrounding the Wikidata entry, the Berlin-based team used a huge language model over the past year.
This information is best visualised in vectorised style as a network of dots and interwoven lines. According to Lydia Pintscher, portfolio lead for Wikidata, Adams would be linked to the word “human” and the names of his books which was reported by a news media firm.
The initiative increases the data’s accessibility to natural language enquiries from LLMs as well as new support for the Model Context Protocol (MCP), a standard that facilitates communication between AI systems and data sources.
Wikimedia’s German division took on the project with Jina.AI and DataStax, a neural search and firm, which is IBM-owned, a real-time training data provider.
For years, Wikidata has provided machine-readable data from Wikimedia properties; however, the tools that were available at the time were limited to keyword searches and the specialised query language SPARQL. The new system will be more compatible with retrieval-augmented generation (RAG) systems, which enable AI models to draw in outside data, allowing developers to base their models on information that has been validated by Wikipedia editors.
Important semantic context is also provided by the data’s structure. For example, searching the database for the word “scientist” will get lists of both Bell Labs scientists and well-known nuclear physicists. A Wikimedia-approved image depicting scientists at work, translations of the word “scientist” into other languages, and extrapolations to related terms like “researcher” and “scholar” are also included.
Toolforge makes the database available to the general public. For developers who are interested, Wikidata will also be holding a webinar on October 9.
The new initiative comes as AI researchers which continue to look for reliable data sources to help them refine their models. Even while the training systems themselves have advanced and are now frequently put together as intricate training environments rather than straightforward datasets, they still need carefully selected data in order to work well. While some may despise Wikipedia, its data is far more fact-oriented than catch-all datasets like the Common Crawl, which is a vast collection of web pages scraped from all over the internet. This is especially important for deployments that require high accuracy.
The drive for high-quality data may occasionally have severe repercussions for AI labs. In August, Anthropic proposed to pay $1.5 billion to resolve a lawsuit against a group of writers whose writings had been used as training materials. This would put an end to any accusations of misconduct.
Philippe Saadé, the project manager for Wikidata AI, had stressed in a news release that his project is not affiliated with any major AI labs or tech businesses. Saadé told reporters, “This Embedding Project launch demonstrates that powerful AI doesn’t have to be controlled by a handful of companies.” “It can be open, cooperative, and designed with everyone’s best interests in mind.”
The researchers had converted Wikidata’s structured data, which was collected up until September 18, 2024, into vectors using a model from the AI company Jina AI. The infrastructure for storing the vector database for the project is presently provided for free by the IBM firm DataStax.
Before upgrading the database with data added over the past year, the team is awaiting input from developers who use it. Small changes or adjustments to already-existing Wikidata won’t make the database any less valuable, according to Saadé, even though the current database does not contain completely fresh material that has been uploaded in the last year. “The vector that we’re computing is ultimately just a general idea of an item, so even a minor edit made on Wikidata won’t have a significant impact,” he stated.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.