Oncode - Solve projects

Welcome to the first problem in the Search Engine Series, a journey to learning how to build a functional and complex search engine from scratch. You will discover step by step some of the different techniques and algorithms used in making search engines.

In this introductory problem, you are going to task is to implement a simple search engine based on an inverse index. It will be in the form of a Rest API with the following functionalities:

Adding Documents: For simplicity we're going to start by handling just text documents in a way that they can be efficiently searched later.
Searching: Query the stored text documents and return results based on the words in the query.

This problem introduces the concept of an inverted index, which you will use to build your solution.

What is an Inverted Index?

An inverted index is a data structure designed for efficient text searching. It maps each unique word (or token) in a collection of documents to the list of document IDs where the word appears. This allows for rapid retrieval of relevant documents based on search queries. Here's a breakdown of how it works:

1. Document Storage

The inverted index is built by processing each document in the system and recording the occurrence of each word within it. In a real-world scenario, you would typically perform preprocessing on the text (e.g., removing stop words, tokenization, stemming). However, for simplicity in this project, we assume each word in the text is a valid token without additional preprocessing.

Index Construction:

For each token in the document:
- If the token is not yet in the index:
  - Add it as a key in the index and associate it with a list containing the document ID.
- If the token is already in the index:
  - Add the document ID to the list if it is not already present.

Example:

Consider two documents:

Document 1: "hello world hello"
Document 2: "hello there"

The inverted index would look like this:

{
  "hello": [1, 2],
  "world": [1],
  "there": [2]
}

Here:

The word "hello" appears in Document 1 and Document 2.
The word "world" appears in Document 1.
The word "there" appears in Document 2.

2. Search

The inverted index is used to efficiently retrieve documents that match a search query. Here's how:

Query Tokenization: The search query is split into individual words or tokens.
Lookup: Each token is looked up in the inverted index to retrieve the list of document IDs.
Result Compilation: The document IDs from all tokens in the query are combined to produce the final result. Depending on the search logic, results can include:
- Union of Results: Documents containing any of the query words.
- Intersection of Results: Documents containing all of the query words.

Example:

Using the inverted index from the previous example:

Query: "hello" → Lookup "hello" → Result: [1, 2]
Query: "world" → Lookup "world" → Result: [1]
Query: "hello world":
- Union Result: Documents [1, 2] (contain either "hello" or "world")
- Intersection Result: Document [1] (contains both "hello" and "world")

Summary

An inverted index is a highly effective data structure for fast and accurate text searching. It maps words to documents and enables efficient retrieval of relevant results. This foundational concept will serve as the building block for more advanced search engine features, such as ranking and handling complex queries, in subsequent problems of this series.

API Specification

You need to implement an API with the following routes:

1. Add Document Route

Endpoint: POST /documents

Request Body:

{
  "id": integer, // Unique identifier for the document
  "text": "string" // The text content of the document
}

Response:

{
  "message": "Document added successfully."
}

Functionality:
- Adds a new document to the system.
- If a document with the same id already exists, update its text.

2. Search Route

Endpoint: GET /search
Query Parameters:
- query: A string containing the search query.
- mode: A string specifying the search mode, either "union" or "intersection".

Response:

{
  "results": [
    {
      "id": integer,
      "text": "string"
    }
  ]
}

Functionality:
- Searches the documents using the words in the query.
- Returns a list of documents (IDs and their text) containing at least one word from the query.

Constraints

Text: Each document's text will not exceed 10,000 characters and will not contain any punctuation.
Number of Documents: The total number of documents will not exceed 10,000.
Query Length: Search queries will not exceed 500 characters.

Search Engine I

Project Overview

Your submissions

Detailed Project Description

What is an Inverted Index?

1. Document Storage

Index Construction:

Example:

2. Search

Example:

Summary

API Specification

1. Add Document Route

2. Search Route

Constraints

Project Completion Criteria