Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 6: Searching & Indexing Big Data - Coggle Diagram
Chapter 6: Searching & Indexing Big Data
Search Engines
Search engines are one of the most widely used implementations of information retrieval systems
Categories
Web Search
Definition
A system that crawls through the World Wide Web, extracts content from HTML tags, and indexes the information for searchability.
Functionality
Crawls the web to gather data from web pages.
Extracts content from HTML tags to understand the information on web pages.
Indexes the information to make it searchable for users.
Some engines also crawl for images and associated files, processing text for additional information and links before indexing.
Example
DuckDuckGo, developed using Solr, is a typical web search engine.
Purpose
To provide users with a convenient way to find relevant information across the vast expanse of the internet.
Vertical Search
Definition:
Addresses a specific domain, focusing on a particular area of information or industry like healthcare or finance.
Scope:
Information is typically sourced from a primary database or data source, such as Word documents or PDF files.
Purpose:
Offers targeted and specialized search results tailored to the specific needs and interests of users within a particular domain or industry.
Characteristics:
Narrow focus on a specific topic or industry.
Content sourced from primary databases or data sources relevant to the domain.
Provides users with highly relevant and specific search results within the chosen vertical.
Examples: Vertical search engines include those dedicated to healthcare, finance, legal research, and academic literature.
Desktop Search
Functionality:
Designed to index files on a user's PC and make them searchable locally.
Indexing:
Some search engines index both file content and metadata, enhancing search capabilities.
Example:
Spotlight in Mac OS X is a typical desktop search engine.
Features:
Indexes files stored locally on the user's PC, including documents, images, videos, and other file types.
Allows users to search for specific files, folders, or content within files using keywords or phrases.
Provides quick and efficient access to locally stored information without requiring an internet connection.
Benefits:
Enhances productivity by allowing users to quickly locate and access files and information stored on their PC.
Others
Text, Image, Audio, and Speech:
Information retrieval extends beyond text and encompasses searching for images, audio fingerprints, and speech recognition.
Applications:
Image Search: Users can search for images based on keywords, visual similarity, or other metadata.
Audio Fingerprinting: Identifies audio content based on unique characteristics or patterns.
Speech Recognition: Converts spoken language into text, enabling search and analysis of spoken content.
Technologies:
Advanced algorithms and machine learning techniques are employed for image recognition, audio analysis, and speech-to-text conversion.
Benefits:
Facilitates efficient retrieval of relevant information regardless of the format or medium.
Widespread Adoption: Increasingly used in various applications such as digital media management, content discovery, and voice-activated assistants.
Enhances user experience by enabling search across various multimedia formats.
Solr and Lucene
enterprise-ready, blazingly fast, and highly scalable search platform built using Apache Lucene.
Solr is written in Java and runs as a stand-alone server.
Apache Lucene is a Java-based search library used for indexing and full-text search on large document collections.
Major Building Block/Components
Request Handler
Solr requests are managed by classes known as SolrRequestHandler.
Each SolrRequestHandler is associated with a specific URI endpoint.
Requests made to a particular endpoint are processed by the corresponding SolrRequestHandler.
Handlers are configured to map to specific URI endpoints for request processing.
Search Component
Search components implement features provided by the search handler.
They need to be registered in a SearchHandler, which serves user queries.
Components handle functionalities like query, spell-checking, faceting, and hit-highlighting.
Multiple components can be registered to a search handler for various functionalities.
Query Parser
Query parsers translate user queries into instructions understandable by Lucene.
They are typically registered in the SearchComponent, which defines search logic.
SearchComponent handles the process of executing a search based on user queries.
Information Retrieval Component & Process
Solr Indexing Process
Text Extraction
Obtain text for indexing from various sources such as files, databases, web pages, or RSS feeds. Extraction can be done by your Java client application or Solr components.
Document preparation
Transform extracted text into a Solr document in the specified format, like XML or JSON.
Post and commit
Post the document to the appropriate Solr endpoint with required parameters. Solr's extraction capabilities are performed based on the endpoint invoked.
Document pre-processing
Perform clean-up, enrichment, or validation of the text received by Solr.
Field analysis
Convert the input stream into terms using analyzers, tokenizers, and token filters defined in the fieldType. This is part of the analysis chain.
Index
Create an inverted index from the terms output by field analysis. Indexed terms are used for matching and ranking in search requests. After posting, pre-processing, and field analysis in Solr, documents are indexed automatically.
Context extraction
It involves extracting content from the data source and converting it into indexable documents.
This process can occur within Solr or in the client application that indexes the documents.
Indexing is the first step in information flow.
Text Processing
Text Processing Overview
Raw documents from databases or extracted text from binary documents need processing before indexing.
Processing tasks include cleansing, normalization, enrichment, and aggregation of text.
Tasks are chained together based on processing needs and data quality.
Text Processing in Solr
Two steps for text processing: analysis process and update request processors.
Analysis process: Field-level tokenization and analysis.
Update request processors: Handle text-processing needs for the entire document.
Cleansing and Normalization
Important for removing irrelevant data and transforming necessary data for indexing.
Tasks include punctuation removal, stop word removal, lowercasing, ASCII conversion, and stemming.
Text Enrichment
Involves analyzing and mining content to enhance usability, understanding, and relevance.
Entity extraction identifies entities like persons, organizations, and locations and annotates the content.
Inverted Index
Indexing for Retrieval:
Data from the source needs indexing for quick and accurate retrieval.
Solr, leveraging Lucene internally, creates an inverted index when documents are added.
Inverted Index:
Primary data structure in search engines, including Solr.
Maintains a dictionary of unique terms and maps them to documents where they appear.
Each term is a key, and its value is a postings list, indicating documents where the term occurs.
Analogy to Book Index:
Similar to the index found at the end of a book.
Contains words and corresponding pages where the words are located.
Retrieval Models
Retrieval Models Overview:
Retrieval models help find relevant documents in search processes.
They use mathematical concepts to define how retrieval works.
Types of Retrieval Models:
Vector Space Model: Represents documents and queries as vectors in a high-dimensional space for comparison.
Probabilistic Model: Calculates the probability of relevance for each document.
Boolean Model: Matches documents based on exact terms.
Language Model: Treats documents and queries as probabilistic language models.
Purpose
They outline how documents are ranked concerning user queries.
Provide a structured approach to understanding document relevance.
Related Technologies
Apache UIMA:
Solr UIMA contrib module structures unstructured data.
Enables defining custom pipelines for analyzing unstructured text.
Annotates extracted metadata.
Integrates annotated metadata into Solr fields.
Carrot clustering:
Carrot2 clusters similar or semantically related search results.
Configuration changes in XML enable clustering.
Apache OpenNLP:
Java library for natural language processing.
Supports tasks to understand and interpret human language text.
Apache Tika:
Apache Tika toolkit parses and extracts contents and metadata from various file formats like Word, PPT, and PDF.
Solr utilizes the Solr Cell framework, integrating with Apache Tika for indexing contents and metadata from these files.
Apache ZooKeeper:
It maintains cluster configurations, store statuses, and synchronization info.
SolrCloud requires at least one active ZooKeeper instance to function properly.
ZooKeeper manages SolrCloud's heartbeat.