Data Integration (Introduction (Essential steps: ( What is the…
Definition of Data Integration
- involves combining data residing in different sources and providing users with a unified view of them
- significant in both commercial and scientific domains
- appears with increasing frequency as the volume and the need to share existing data explodes
- has become the focus of extensive theoretical work
Steps of Data Integration
- Data profiling: what are these databases about, what do they contain?
- Data quality: is the data complete, free of contradictions, bull values?
- Describe the sources: here, the relational model will do fine
- Match and map schemas: find correspondences between table and attribute names, try to associate and unify them
- Move data: execute the established mappings
Reasons to integrate data
- In reality, data sets (or databases) are often created independently
- Only discover later that they need to be combined!
- But: different systems, different schemata, different data quality, limited interfaces to the data.
- One goal of data integration: tie together different sources, controlled by many people, under common schema.
- Create a (useful) Web site for tracking services
- Build an (internal) search engine
- Collaborate with third parties, e.g. create branded services
- Comply with government regulations (or the NSA)
- Business intelligence, e.g. what's really wrong with our products?
- Create a recommender for specific services
How to do it?
- CRISP-DM Method: CRoss Industry Standard Process for Data Mining
- What is the application area? What data do I need?
- Specify context, purpose, source(s) and target of a DI task.
- Get an overview of what the data sources needed are about
--> Data profiling
- Evaluate data quality; improve if necessary
- Find a suitable way to describe the sources, sometimes using IR techniques.
- Data (pre-)processing, integration, fusion
- Match and map schemas, in order to provide a uniform interface.
- Migrate data to target schemas
- Application of algorithmic steps
- Presentation/visualization of results
- Interpretation & Evaluation, Extension to other areas/sources
Why is DI hard?
- Systems-level reasons:
- Managing different (autonomous) platforms
- scale of sources: from tens to millions
- SQL across multiple systems is not so simple
- Distributed query processing
- Logical reasons:
- Schema (and data) structure, heterogeneity
- Social reasons:
- Locating and capturing relevant data in the enterprise
- Convincing people to share (data fiefdoms)
- Security, privacy, and performance implications
- Data models and modeling
- Steps of a database design process
- Database queries and query languages, e.g. SQL vs. relational algebra
- Query processing and optimization
- Functional database system layers
- Centralized vs. distributed systems - what is different?
- Data integration: abstract away the fact that data comes from multiple sources in varying schemata
- Problem occurs everywhere: it's key to business, science, Web and government
- Reduce the effort involved in integrating
- Allow for available data to be utilized
- Regardless of the architecture, heterogeneity is a key issue
Data Quality is the methodical approach, policies and processes by which an organization manages the accuracy, validity, timeliness, completeness, uniqueness and consistency of its data in systems and data flows.
Data Quality Dimensions
- refers to the aspect or feature of information that can be assessed and used to determine quality of data.
- Accuracy: data accurately represents "real-world" values
- Validity: Data conforms to the syntax of its definition
- Timeliness: Data represents reality from the required point in time
- Completeness: Data are complete in terms of required potential of data
- Uniqueness: Data are properly identified and recorded only once
- Consistency: Data are represented consistently across the data set
Data Quality Rules
- refers to business rules intended to ensure quality of data in terms of accuracy, validity, timeliness, completeness, uniqueness and consistency
- each rule is associated to particular Data Quality Dimension. Multiple rules can be associated to one Dimension
Data Quality Process
- Define DQ Requirements
- Perform data profiling in order to help us discover value frequencies or formats of data
- Data profiling can be performed by using specialized tool or query languages that are supported by data source
- Although some data quality problems can be discovered during data profiling activity, the purpose of data profiling is to give insights for data quality assessment
- Conduct DQ Assessment
- Define data quality rules and quality thresholds
- Perform Data Quality Assessment by enforcing data quality rules on existing data set
- Identify Data Quality Issues and update Issue log
- Resolve DQ Issues
- For data quality issues identified during data quality assessment conduct "root cause analysis" to determine issue root cause
- Conduct issue resolution by eliminating root cause
- Review data policies and procedures if necessary
- Monitor and Control
- Define and populate DQ Scorecards
- Monitor data quality
DQ Process and SDLC
Data Quality process should be part of SDLC (System Development Live Cycle)
- SDLC refers to the process of planning, creating, testing and deploying an information system
Data Quality Roles
is the key role responsible to perform activities associated with the data quality process. DQ Analysts will closely work with Business Owners, Data Stewards, Technical Owners and Data Custodians. That includes but is not limited to: definition of data quality rules, analyze results of data quality profiling and assessments, investigating root causes for data quality issues
Data Quality Technology Support:
Key Metadata Tool Requirements:
- Ability to conduct data profiling, including statistical analysis of data sets
- Ability to define and execute data quality rules for critical data elements which are subject of data quality check
- Ability to store data quality profiling and assessment results
- Ability to conduct issue resolution process and discover issue patterns
- Ability to create and visualize data quality scorecards
The web can be represented as a graph
- Web pages as nodes
- Links as directed edges
How to find orientation on the web?
- human-curated web: directories; Yahoo!, DMOZ, LookSmart
- Web search: look for web pages that are relevant to terms occuring in search queries
- in both cases: be prepared for things you cannot trust
Basic Architecture of Search Engines
- Web pages (getting crawled)
- Several data sources
- Index or database
- Runtime system
- User search query
How crawlers work
- A crawler traverses web pages, downloads them for indexing and follows (or harvests) the hyperlinks on the downloaded pages.
- Standard algorithms used
- Breadth First Search (BFS)
- Partial Indegree
- Partial PageRank
- Random Walk
- Focused crawlers use best-first strategy
- Web crawling isn't feasible with one machine
- All of the above steps get distributed
- Even non-malicious pages pose challenges
- Latency/bandwidth to remote servers vary
- Webmasters' stipulations
- How deep should you crawl a site's URL hierarchy?
- There may be site mirrors and duplicate pages
- Malicious pages
- spam pages
- Spider traps - incl. dynamically generated ones
- Politeness - don't hit a server too often
Processing steps in cralwing
- Pick a URL from the harvested ones
- Fetch the document at the URL
- Parse the document
- Extract links from it to other docs (URLs)
- Check if URL has content already been seen
- For each extracted URL
- Ensure it passes certain URL filter tests
- Check if it has already been harvested
Building a Search Query
- Query: Syntax and form dependent on search engine
- Typically, string of words or a single word
- Phrases in quotation marks
- Complex queries may require Boolean operators
- AND, OR, NOT, FOLLOWED BY, NEAR
- Natural language queries
- Ask a question
- parses keywords
Central Issues for Search engines
- How does the engine find as many relevant pages as possible? (crawling, indexing)
- How does the engine find me? (-> SEO, SEM)
- How can the search engine know what the user has in mind? (e.g. "Fluke" can mean a fish, the end parts of an anchor, the fins on a whale's tail, or a stroke of luck)
- In which order are search results presented to a user?
. How to make money if search is free?
- General-purpose search engines, e.g. Google, Bing, Baidu etx., which search the web
- Special-purpose search engines which search within a special context or environment only
Ranking search results
- Not all Web pages are equally important!
- There is large diversity in the Web-graph node connectivity, so let's rank pages by their link structure!
Link analysis algorithms
- Link analysis approaches for computing the importance of a node in a graph
- Page rank
- Hubs and Authorities
- Topic- Specific PageRank
- Web Spam Detection Algorithms
- Idea: Links as votes
- Page is more important if it has more links
- In-coming links? Out-going links?
- Think of in-links as votes:
- Are all in-links equal?
- Links from important pages count more
- Recursive question!
Page Rank: Videos
Problems with PageRank
- Measure generic popularity of a page
- Biased against topic-specific authorities
- Solution: Topic-specific PageRank
- Use a single measure of importance
- Solution: Hubs-and-authorities
- Susceptible to link spam
- Artificial link topographies created in order to boost PageRank
- Solution: TrustRank
- Instead of generic popularity, can we measure popularity within a topic?
- Goal: Evaluate Web pages not just according to their popularity, but how close they are to a particular topic, e.g. "sports" or "history"
- Allows search queries to be answered based on interests of the user
Hubs and Authorities
- HITS (Hypertext-Induced Topic Selection)
- Is a measure of importance of pages or documents, similar to PageRank
- Proposed at around same time as PageRank
- Goal: Imagine we want to find good newspapers
- Don't just find newspapers. Find "experts" - people who link in a coordinated way to good newspapers
- Idea: Links as votes
- Page is more important if it has more links
- Interesting pages fall in two classes
- Authorities are pages containing useful information
- Hubs are pages that link to authorities
- Each page has two scores:
- Quality as an expert (hub): Total sum of votes of pages pointed to
- Quality as content (authority): Total sum of votes of experts
- HITS algorithm puts this intuition into a rigorous mathematical framework based on the principle of repeated improvement
- Spamming: any deliberate action to boost a WEb page's position in search engine results, incommensurate with page's real value
- Spam: Web pages that are the result of spamming
- approx. 10-15 % of Web pages are spam
Early Spammers: Term Spam
- Example: Short seller might pretend to be about "movies"; here are two ways how:
- Add the word 'Movie' 1000 times to your page. Set text color to the background color, so only search engines would see it
- Or, run the query "movie" on your target search engine. See what page came first in the listings. Copy it into your page, make it "invisible"
- Google's solution: Beleive what people say about you, rather than what you say about yourself
(PageRannk as a tool to measure the "importance" of Web pages)
- In the example:
- Shirt seller creates 1000 pages, each links to his with "movie" in the anchor text
- These pages have no links in, so they get a small PageRank (they are not very important, so they won't be ranked high for shirts or movies)
- So the shirt seller can't beat truly important movie pages
- Spammer's goal: Maximizte the PageRank of target page
- Idea: Creating link structure that boost the PageRank of a particular page
- Three kinds of web pages from a spammer's point of view:
- Inaccessible pages
- Accessible pages, e.g. blog comments pages - spammer can post links to his pages
- Own pages, completely controlled by spammer, may span multiple domain names
- Formalisms to define languages
- Use Cases:
- Validate that string conforms to format/pattern
- Special case: Define tokens in programming languages
- Find string of specific pattern in text
- Replace string of specific pattern in text
- Split/elementize strigns
- Versatile tool
- in most programming languages
- in reasonable text editors
- In UNIX command line tools, e.g. grep, sed
- In data profiling tools (patterns for phone numbers, e-mail addresses, URIs,...)
- In data integrationn toolds (matching of various formats to standardized representations)
- Regexp matching can be implemented via finite-state automata (mostly efficiently)
- Regexp and finite state automata specify precisely the same languages, namely the regular languages
- There is no regexp to match correctly nested (arbitrarily deeply) paranthesized expressions
- Reson: finite memory!
- Regular languages vs. context-free languages
Alphabets, Words and Languages
- Alphabet = finite set of symbols
- latin characters
- ASCII, UTF-8
- Keyboard alphabet
- Word (over alphabet A)
= String (over alphabet A)
= finite sequence of symbols (over alphabet A)
- Language (over alphabet A)
= set of words (over alphabet A)
Operations on words
- If v and w are words, then vw is a word, the concatenation of v and w
- Epsilon is the neutral element w.r.t. concatenation, i.e. eW=We=w for all words w
- If w is a word and k a non-negative integer then the power w^k is defined as follows:
- w^0 = e
- W^k = w^(k-1)*w for k >0
- Multiple entries for the same real-world entity e.g. customer in CRM
- higher costs if customers get contacted multiple times
- Duplicates falsify statistics on customers, items in stock, etc.
Probabilistic duplicate detection
- Test similarity of records, identify relative to probability
Deterministic duplicate detection
- Test equality of normalized version of record
- Normalization loses information
- Very fast when it works!
- Hand-coded rules for an "acceptable match"
- e.g. same SSN, or same zipcode, birthdate
- again fast, whenn working; difficult to tune, expensive to test
How to define similarity functions?
- Many functions proposed (edit distance, cosine similarity ...)
- Domain knowledge is critical
- Cosine (set)
- Cosine (vector)
- Given: Collections X and Y of records
- Near duplicates = pairs of records x e X and y e Y with high similarity (measured quantitatively by similarity function and threshold t)
- The similarity join problem is to find all pairs of records <x,y> such that sim(x,y)>= t
- Applying sim(x,y) to all pairs of candidates xe X and y e Y is impractical (quadratic in size of data)
- Solution: apply sim(x,y) to only the most promising pairs, using a method FindCands
- for each string x e X use method FindCands to find a candidate set Z in Y for each string y e Z if sim(x,y) >= t then return (x,y) as a matched pair
- set Z is called umbrella set of x in following
- We now discuss ways to implement FindCands
- Using Jaccard and overlap measures for now
How to detect similarities efficiently?
- Running time should scale with the data set
How to detect similarities effectively?
- Method should be of high quality
Terminology and Goals
Goal: Given a duplicate, create a single object representation while resolving conflicting data values.
- Contradictions in data values
- Uncertainty and truth: Discover the true value and model uncertainty in this process
- Metadata: Preferences, recency, correctness
- Lineage: Keep original values and their origin
- Implementation in DBMS: SQL, extended SQL, UDFs, etc.
- Identical tuples: agree on all attributes
- Subsumed tuple: contains more null values and is otherwise equal
- Complementing tuples: both have non-null values in distinct places that fit together
Conflicting tuples: at least one non-null attribute where they differ