Search Engine I

Get started on the first variant of building a search engine

1 hour
easy
0 submissions

Project Overview

Get started on the first variant of building a search engine

Welcome to the first challenge in the Search Engine Series, a step-by-step guide to building a fully functional and complex search engine. This introductory problem lays the groundwork by implementing a simple search engine based on an inverted index.

You will create a REST API that includes two essential functionalities:

  1. Adding Documents: Store text documents in a structured format to enable efficient searching.
  2. Searching: Query the stored documents to retrieve results based on the words in the query.

Your submissions

No submissions yet, start by making your first submission

Detailed Project Description

Welcome to the first problem in the Search Engine Series, a journey to learning how to build a functional and complex search engine from scratch. You will discover step by step some of the different techniques and algorithms used in making search engines.

In this introductory problem, you are going to task is to implement a simple search engine based on an inverse index. It will be in the form of a Rest API with the following functionalities:

  1. Adding Documents: For simplicity we're going to start by handling just text documents in a way that they can be efficiently searched later.
  2. Searching: Query the stored text documents and return results based on the words in the query.

This problem introduces the concept of an inverted index, which you will use to build your solution.


What is an Inverted Index?

An inverted index is a data structure designed for efficient text searching. It maps each unique word (or token) in a collection of documents to the list of document IDs where the word appears. This allows for rapid retrieval of relevant documents based on search queries. Here's a breakdown of how it works:


1. Document Storage

The inverted index is built by processing each document in the system and recording the occurrence of each word within it. In a real-world scenario, you would typically perform preprocessing on the text (e.g., removing stop words, tokenization, stemming). However, for simplicity in this project, we assume each word in the text is a valid token without additional preprocessing.

Index Construction:

  1. For each token in the document:
    • If the token is not yet in the index:
      • Add it as a key in the index and associate it with a list containing the document ID.
    • If the token is already in the index:
      • Add the document ID to the list if it is not already present.

Example:

Consider two documents:

  • Document 1: "hello world hello"
  • Document 2: "hello there"

The inverted index would look like this:

{
  "hello": [1, 2],
  "world": [1],
  "there": [2]
}

Here:

  • The word "hello" appears in Document 1 and Document 2.
  • The word "world" appears in Document 1.
  • The word "there" appears in Document 2.

2. Search

The inverted index is used to efficiently retrieve documents that match a search query. Here's how:

  • Query Tokenization: The search query is split into individual words or tokens.
  • Lookup: Each token is looked up in the inverted index to retrieve the list of document IDs.
  • Result Compilation: The document IDs from all tokens in the query are combined to produce the final result. Depending on the search logic, results can include:
    • Union of Results: Documents containing any of the query words.
    • Intersection of Results: Documents containing all of the query words.

Example:

Using the inverted index from the previous example:

  • Query: "hello" → Lookup "hello" → Result: [1, 2]
  • Query: "world" → Lookup "world" → Result: [1]
  • Query: "hello world":
    • Union Result: Documents [1, 2] (contain either "hello" or "world")
    • Intersection Result: Document [1] (contains both "hello" and "world")

Summary

An inverted index is a highly effective data structure for fast and accurate text searching. It maps words to documents and enables efficient retrieval of relevant results. This foundational concept will serve as the building block for more advanced search engine features, such as ranking and handling complex queries, in subsequent problems of this series.


API Specification

You need to implement an API with the following routes:

1. Add Document Route

  • Endpoint: POST /documents
  • Request Body:
    { "id": integer, // Unique identifier for the document "text": "string" // The text content of the document }
  • Response:
    { "message": "Document added successfully." }
  • Functionality:
    • Adds a new document to the system.
    • If a document with the same id already exists, update its text.

2. Search Route

  • Endpoint: GET /search

  • Query Parameters:

    • query: A string containing the search query.
    • mode: A string specifying the search mode, either "union" or "intersection".
  • Response:

    { "results": [ { "id": integer, "text": "string" } ] }
  • Functionality:

    • Searches the documents using the words in the query.
    • Returns a list of documents (IDs and their text) containing at least one word from the query.

Constraints

  1. Text: Each document's text will not exceed 10,000 characters and will not contain any punctuation.
  2. Number of Documents: The total number of documents will not exceed 10,000.
  3. Query Length: Search queries will not exceed 500 characters.

Project Completion Criteria

  • The system should allow users to add a document with a unique id and its text content.
  • The system should allow users to search for documents containing any query word using the union mode.
  • The system should allow users to search for documents containing all query words using the intersection mode.
  • The system should return an empty list if no documents match the search query.
  • The system should dynamically update the inverted index when documents are added or updated.
  • The system should conform to the specified request and response structures for the POST /documents and GET /search endpoints.