Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 6: Searching & Indexing Big Data - Coggle Diagram

- - - - Definition
        
        A system that crawls through the World Wide Web, extracts content from HTML tags, and indexes the information for searchability.
      - Functionality
        
        Crawls the web to gather data from web pages.
        
        Extracts content from HTML tags to understand the information on web pages.
        
        Indexes the information to make it searchable for users.
        
        Some engines also crawl for images and associated files, processing text for additional information and links before indexing.
      - Example
        
        DuckDuckGo, developed using Solr, is a typical web search engine.
      - Purpose
        
        To provide users with a convenient way to find relevant information across the vast expanse of the internet.
    - - Definition:
        
        Addresses a specific domain, focusing on a particular area of information or industry like healthcare or finance.
      - Scope:
        
        Information is typically sourced from a primary database or data source, such as Word documents or PDF files.
      - Purpose:
        
        Offers targeted and specialized search results tailored to the specific needs and interests of users within a particular domain or industry.
      - Characteristics:
        
        Narrow focus on a specific topic or industry.
        
        Content sourced from primary databases or data sources relevant to the domain.
        
        Provides users with highly relevant and specific search results within the chosen vertical.
        
        Examples: Vertical search engines include those dedicated to healthcare, finance, legal research, and academic literature.
    - - Functionality:
        
        Designed to index files on a user's PC and make them searchable locally.
      - Indexing:
        
        Some search engines index both file content and metadata, enhancing search capabilities.
      - Example:
        
        Spotlight in Mac OS X is a typical desktop search engine.
      - Features:
        
        Indexes files stored locally on the user's PC, including documents, images, videos, and other file types.
        
        Allows users to search for specific files, folders, or content within files using keywords or phrases.
        
        Provides quick and efficient access to locally stored information without requiring an internet connection.
      - Benefits:
        
        Enhances productivity by allowing users to quickly locate and access files and information stored on their PC.
    - - Text, Image, Audio, and Speech:
        
        Information retrieval extends beyond text and encompasses searching for images, audio fingerprints, and speech recognition.
      - Applications:
        
        Image Search: Users can search for images based on keywords, visual similarity, or other metadata.
        
        Audio Fingerprinting: Identifies audio content based on unique characteristics or patterns.
        
        Speech Recognition: Converts spoken language into text, enabling search and analysis of spoken content.
      - Technologies:
        
        Advanced algorithms and machine learning techniques are employed for image recognition, audio analysis, and speech-to-text conversion.
      - Benefits:
        
        Facilitates efficient retrieval of relevant information regardless of the format or medium.
        
        Widespread Adoption: Increasingly used in various applications such as digital media management, content discovery, and voice-activated assistants.
        
        Enhances user experience by enabling search across various multimedia formats.
- - - - Obtain text for indexing from various sources such as files, databases, web pages, or RSS feeds. Extraction can be done by your Java client application or Solr components.
    - - Transform extracted text into a Solr document in the specified format, like XML or JSON.
    - - Post the document to the appropriate Solr endpoint with required parameters. Solr's extraction capabilities are performed based on the endpoint invoked.
    - - Perform clean-up, enrichment, or validation of the text received by Solr.
    - - Convert the input stream into terms using analyzers, tokenizers, and token filters defined in the fieldType. This is part of the analysis chain.
    - - Create an inverted index from the terms output by field analysis. Indexed terms are used for matching and ranking in search requests. After posting, pre-processing, and field analysis in Solr, documents are indexed automatically.
  - - - Raw documents from databases or extracted text from binary documents need processing before indexing.
      - Processing tasks include cleansing, normalization, enrichment, and aggregation of text.
      - Tasks are chained together based on processing needs and data quality.
    - - Two steps for text processing: analysis process and update request processors.
      - Analysis process: Field-level tokenization and analysis.
      - Update request processors: Handle text-processing needs for the entire document.
    - - Important for removing irrelevant data and transforming necessary data for indexing.
      - Tasks include punctuation removal, stop word removal, lowercasing, ASCII conversion, and stemming.
    - - Involves analyzing and mining content to enhance usability, understanding, and relevance.
      - Entity extraction identifies entities like persons, organizations, and locations and annotates the content.
  - - - Data from the source needs indexing for quick and accurate retrieval.
      - Solr, leveraging Lucene internally, creates an inverted index when documents are added.
    - - Primary data structure in search engines, including Solr.
      - Maintains a dictionary of unique terms and maps them to documents where they appear.
      - Each term is a key, and its value is a postings list, indicating documents where the term occurs.
    - - Similar to the index found at the end of a book.
      - Contains words and corresponding pages where the words are located.
  - - - Retrieval models help find relevant documents in search processes.
      - They use mathematical concepts to define how retrieval works.
    - - Vector Space Model: Represents documents and queries as vectors in a high-dimensional space for comparison.
      - Probabilistic Model: Calculates the probability of relevance for each document.
      - Boolean Model: Matches documents based on exact terms.
      - Language Model: Treats documents and queries as probabilistic language models.
    - - They outline how documents are ranked concerning user queries.
      - Provide a structured approach to understanding document relevance.