Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Integration (Introduction (Essential steps: (What is the application…

- - - - Specify context, purpose, source(s) and target of a DI task.
        
        Get an overview of what the data sources needed are about
        --> Data profiling
        
        Evaluate data quality; improve if necessary
        
        Find a suitable way to describe the sources, sometimes using IR techniques.
    - - Match and map schemas, in order to provide a uniform interface.
        
        Migrate data to target schemas
- - - - Hatches
        
        Web crawling isn't feasible with one machine
        
        All of the above steps get distributed
        
        Even non-malicious pages pose challenges
        
        Latency/bandwidth to remote servers vary
        
        Webmasters' stipulations
        
        How deep should you crawl a site's URL hierarchy?
        
        There may be site mirrors and duplicate pages
        
        Malicious pages
        
        spam pages
        
        Spider traps - incl. dynamically generated ones
        
        Politeness - don't hit a server too often
      - Processing steps in cralwing
        
        Pick a URL from the harvested ones
        
        Fetch the document at the URL
        
        Parse the document
        
        Extract links from it to other docs (URLs)
        
        Check if URL has content already been seen
        
        If not, add to indexes
        
        For each extracted URL
        
        Ensure it passes certain URL filter tests
        
        Check if it has already been harvested
  - - - Again: the Web as a graph with web pages as nodes and links as edges
      - Intuition
        
        Idea: Links as votes
        
        Page is more important if it has more links
        
        In-coming links? Out-going links?
        
        Think of in-links as votes:
        
        Are all in-links equal?
        
        Links from important pages count more
        
        Recursive question!
    - - Problems with PageRank
        
        Measure generic popularity of a page
        
        Biased against topic-specific authorities
        
        Solution: Topic-specific PageRank
        
        Use a single measure of importance
        
        Solution: Hubs-and-authorities
        
        Susceptible to link spam
        
        Artificial link topographies created in order to boost PageRank
        
        Solution: TrustRank
        
        Topic-Specific PageRank
        
        Instead of generic popularity, can we measure popularity within a topic?
        
        Goal: Evaluate Web pages not just according to their popularity, but how close they are to a particular topic, e.g. "sports" or "history"
        
        Allows search queries to be answered based on interests of the user
        
        Hubs and Authorities
        
        HITS (Hypertext-Induced Topic Selection)
        
        Is a measure of importance of pages or documents, similar to PageRank
        
        Proposed at around same time as PageRank
        
        Goal: Imagine we want to find good newspapers
        
        Don't just find newspapers. Find "experts" - people who link in a coordinated way to good newspapers
        
        Idea: Links as votes
        
        Page is more important if it has more links
        
        Interesting pages fall in two classes
        
        Authorities are pages containing useful information
        
        Hubs are pages that link to authorities
        
        Each page has two scores:
        
        Quality as an expert (hub): Total sum of votes of pages pointed to
        
        Quality as content (authority): Total sum of votes of experts
        
        HITS algorithm puts this intuition into a rigorous mathematical framework based on the principle of repeated improvement
    - - Early Spammers: Term Spam
        
        Example: Short seller might pretend to be about "movies"; here are two ways how:
        
        Add the word 'Movie' 1000 times to your page. Set text color to the background color, so only search engines would see it
        
        Or, run the query "movie" on your target search engine. See what page came first in the listings. Copy it into your page, make it "invisible"
      - Example cont'd
        
        Google's solution: Beleive what people say about you, rather than what you say about yourself
        (PageRannk as a tool to measure the "importance" of Web pages)
        
        In the example:
        
        Shirt seller creates 1000 pages, each links to his with "movie" in the anchor text
        
        These pages have no links in, so they get a small PageRank (they are not very important, so they won't be ranked high for shirts or movies)
        
        So the shirt seller can't beat truly important movie pages
      - Link Spam
        
        Spammer's goal: Maximizte the PageRank of target page
        
        Idea: Creating link structure that boost the PageRank of a particular page
        
        Three kinds of web pages from a spammer's point of view:
        
        Inaccessible pages
        
        Accessible pages, e.g. blog comments pages - spammer can post links to his pages
        
        Own pages, completely controlled by spammer, may span multiple domain names
- - - - Probabilistic duplicate detection
        
        Test similarity of records, identify relative to probability
      - Deterministic duplicate detection
        
        Test equality of normalized version of record
        
        Normalization loses information
        
        Very fast when it works!
        
        Hand-coded rules for an "acceptable match"
        
        e.g. same SSN, or same zipcode, birthdate
        
        again fast, whenn working; difficult to tune, expensive to test
    - - How to define similarity functions?
        
        Many functions proposed (edit distance, cosine similarity ...)
        
        Domain knowledge is critical
        
        Similarity functions
        
        Overlap
        
        Jaccard
        
        Cosine (set)
        
        Cosine (vector)
        
        Similarity Join
        
        Given: Collections X and Y of records
        
        Near duplicates = pairs of records x e X and y e Y with high similarity (measured quantitatively by similarity function and threshold t)
        
        The similarity join problem is to find all pairs of records <x,y> such that sim(x,y)>= t
        
        Scalability challenges
        
        Applying sim(x,y) to all pairs of candidates xe X and y e Y is impractical (quadratic in size of data)
        
        Solution: apply sim(x,y) to only the most promising pairs, using a method FindCands
        
        for each string x e X use method FindCands to find a candidate set Z in Y for each string y e Z if sim(x,y) >= t then return (x,y) as a matched pair
        
        set Z is called umbrella set of x in following
        
        We now discuss ways to implement FindCands
        
        Using Jaccard and overlap measures for now
      - How to detect similarities efficiently?
        
        Running time should scale with the data set
      - How to detect similarities effectively?
        
        Method should be of high quality