Please enable JavaScript.
Coggle requires JavaScript to display documents.
24badfb4478e4ccdb5ce060eb88e4b5d
Schedule
Slides
Videos
Opinionated…
Framework
-
Profession and Education
Roles for Computing in Social Change (Harvard, Cornell) Positing that computational research has valuable roles to play in addressing social problem
- Computing as diagnostic
Computing can help us measure social problems and diagnose how they manifest in technical systems.
- Computing as formalized
Computing shapes how social problems are explicitly defined — changing how those problems, and possible responses to them, are understood.
- Computing as rebuttal
Computing can clarify the limits of technical interventions, and of policies premised on them.
- Computing as synecdoche
Computing can foreground long-standing social problems in a new way.
-
Fairness
Counterfactual
:fountain_pen: FlipTest: Fairness Testing via Optimal Transport (CMU)
- Flipping a feature in the data for assessment - doesn't work (e.g., OoD samples)
- Counterfactual fairness - with casual model - bah, we need casual model!
- Idea: Optimal Transport - transfer one distribution (e.g., men) to another one (women), by moving the one datapoint to another.
- Query the model with the pairs of matched datapoints (flipset) - we don't want a difference between the two.
- They used GANs to build the the mapping.
:fountain_pen: Counterfactual risk assessments, evaluation, and fairness (CMU) Problem: Most of algorithmic risk assessment are trained and evaluated on historical data in which the outcomes observed depend on the historical decision-making policy. Counterfactual risk modeling and evaluation to properly account for these intervention effects without building SCM. Technique for calculating metrics: Doubly-robust (DR) counterfactual evaluation. Regarding fairness:
- Counterfactual formulations of three standard fairness metrics that are more appropriate for decision-making settings.
- Theoretical results showing that only under strong conditions, which are unlikely to hold in general, does fairness according to standard metrics imply fairness according to counterfactual metrics.
- Empirically, applying existing fairness-corrective methods can increase disparity in the counterfactual redefinition of the metric they target.
Case study: Child welfare screening
- The outcome is re-referral within a six month period.
Economics
:fountain_pen: Fair classification and social welfare (Harvard)
- How do leading notions of fairness as defined by computer scientists map onto longer-standing notions of social welfare?
- Welfares defined by the number of positively labeled individuals, achievable for each social group.
- Pareto principle requires that when all individuals strictly prefer alternative x to alternative y, so does social planner.
- ERM with proxy fairness constraint: ε-fair Soft SVM.
- We prove that stricter fairness standards do not necessarily support welfare-enhancing outcomes for the disadvantaged group.
- In many such cases, the learning goal of ensuring group-based fairness is incompatible with the Pareto Principle.
- Asking that an algorithmic procedure abide by a more stringent fairness criteria can lead to enacting classification schemes that actually make every stakeholder group worse!
- Efficient algorithm to find the welfare value for each ε.
- Approach: figure out when a change in ε, cause to a change in the classification outcome of individuals.
- Experiment on the Adult dataset.
:fountain_pen: Fairness and utilization in allocating resources with uncertain demand (Cornell)Task: Resource allocation when the demands for the resource are distributed across multiple groups and drawn from (known) probability distributionsNatural fairness requirement: individuals from different groups should have (approximately) equal probabilities of receiving the resource.Metrics:
- Utilization = the expected number of resources that get used
- Probability gap = The maximal difference (across individuals) of the probability of an individual to obtain a resource, given that it needs it
Utalization ratio (\(UR(\alpha)\)) is the ratio utilizations obtaineds: \(\frac{\text{Max utalization}}{\textbf{Max utilization with }\alpha\textbf{-gap}}\). Also called the Price of Fairness.Core results
In the worst case, utilization and probability gap objectives can be completely opposed. But for many commonly-seen distributions, allocations that optimize for one objective also do very well on the other.
- Discrete allocation → For \(alpha < 1\), \(UR(\alpha)\) is unbounded.
- Continuous/probabilistic allocation → For \(\alpha > 0\), \(UR(\alpha)\) upper bounded by \(\frac{1}{\alpha}\), but \(UR(\alpha)\) unbounded for \(\alpha = 0\).
- A specific family (including exponential, Weibull) → Max utilization and probability gap \(\alpha = 0\) can be achieved simultaneously: \(UR(\alpha) = 1\).
- Power law distributions → Goals are closely aligned: Given a fixed number of groups, \(UR(\alpha)\) is bounded by small constant independent of distribution parameters.
:fountain_pen: The effects of competition and regulation on error inequality in data-driven markets (UPenn, MSR)Economic incentives that can drive unfairness
- Consider setting in which firms use data to provide product/service to consumers
- Consumers all benefit from increased accuracy - think speech recognition, search (NOT loans, insurance, etc.)
- Unfairness formalized as gap in error rates across groups
Takeaways
- Economic incentives may affect fairness outcomes in real-world data-driven markets
- May not be enough to identify technical sources of unfairness and build algorithmic fairness tools. May need to change incentives!
Dynamics (Impact)
:fountain_pen: Fairness Is Not Static: Deeper Understanding of Long Term Fairness via Simulation Studies (Google)
- Using MDP framework to run simulation of the impact of deployed models (without reward)
- Agent (Algorithm)
- Environment (Society)
- Scenario I: Binary classifier (lending)
- Agents
- Results
- Diverging narratives from one-step (analytical) analysis
- EO agents and aggregated TPR
- Scenario II: Attention allocation
- Agents
- Uniform
- Proportional
- Greedy
- Exploration
- Metric: the maximal gap in empirical discovery probabilities between all pairs of sites
- Results
- Effectiveness - with dynamics, maximizing hits is not the same as minimizing misses
- Fairness - agents behave differently
- Scenario III: Strategic manipulation (college admission)
- Individuals are able to pay a cost to manipulate their features (and increase their score)
- Applicants are aware of the agent decision rule
- Metric: Social burden
- Agent
- Static (one-short)
- Robust
- Continuous
- Results
- Continuous retraining compostates for static manipulation
- Stronger with noise in the score-label relationship
- Trade-off: agent utility vs. individual utility
:fountain_pen: The disparate equilibria of algorithmic decision making when individuals invest rationally (UC Berkley, MSR, Cornell)Analyzing feedback look in classification (e.g., hiring, university admission)Model
- Individual’s Rational Response
- Invest a cost to acquire qualifications (\(Y=1\)).
- It depends on the qualification assessment rule currently implemented by the institution.
- In any group, the cost is distributed randomly.
- Get payoff if assessed to be qualified.
- Institution’s Rational Response
- Choose a qualification assessment parameter for accepting individuals to maximize its utility
- Gain (TP) and cost (FP)
- Maximize expected utility (can be expressed as the rate of qualification in each group)
- Have infinitely many samples from the underlying distributions
Dynamics
What is the Equilibria?Metrics of interest:
- Stability
- Rate of qualification in each group
- The balance between rates of qualification
- Institutional utility
Result:
- If there exists a zero-error hiring policy in the model class, there is a unique (non-trivial) equilibrium
- All groups have the same qualification rate at equilibrium. This is also the optimal qualification rate.
- This also holds approximately if there exists a low-error hiring policy.
Challenge: Heterogeneity across groups
- There exists a zero-error hiring policy for each group separately but not together.
- Then 2 types of equilibria exist
- Only one group has the optimal (unbalanced) - Stable
- Both groups have the same qualification rate — Unstable
- Almost never converge to a "balanced" long term outcome, even if you started close to one!
Interventions
- Decoupling
- Always helps in the group-realizable setting: not only does it not decrease any group’s equilibrium qualification rate
- It also increases the equilibrium qualification rate of at least one group when realizability across all groups does not hold
- When group-realizability does not hold, we see that in some cases decoupling is still helpful while in others it can significantly harm one group
- Subsidizing the cost of investment in a disadvantaged group
- Stable unbalanced
- Unstable more balanced
- In the non-realizable setting, it also improve the quality of equilibria. However, the new equilibrium is not guaranteed to be locally stable
Case-study: FICO
etc
Privacy
:fountain_pen: Fair decision making using privacy-protected data (Duke, UMass)
- Setting: Sensitive personal data is used to decide who will receive resources or benefits
- RQ: Impact of differential privacy on fair and equitable decision-making
- Real-word scenarios
Takeaways:
- If decisions are made using an ϵ-differentially private version of the data, under strict privacy constraints (smaller ϵ), the noise added to achieve privacy may disproportionately impact some groups over others
- Designers of privacy algorithms must evaluate the fairness of outcomes, in addition to conventional aggregate error metrics that have historically been their focus
- Optimizing for aggregate error on published statistics does not reliably lead to more accurate or fair outcomes for a downstream decision problem (DAWA vs. Laplace).
Scenario I: Minority language voting right
- Task: Binary decision rule based on district statistics (which are released ϵ-DP)
- Fairness metrics: Equality for all jurisdictions P(covered | data)(randomness comes from the DP algorithm)
- Findings:
- There are significant disparities in the rate of correct classification across jurisdictions
- Significant differences in the rate of successful classification across jurisdictions is a consequence of the decision rule and its interaction with the noise added for privacy
- A jurisdiction’s distance from the nearest threshold explains classification rates for D-Laplace but not DAWA (it exacerbates disparities)
Mitigating unfairness: estimate the posterior probability that the jurisdiction is Covered given the observed noisy counts, and set a threshold (trade-off between FP and FN) Scenario II: Title I funds allocation
At least $675 billion dollars relies on data released by the Census Bureau Scenario III: Apportionment of legislative representative at
Data- and Workload-Aware (DAWA): Introduce complex noise that is adapted to the input data.
-
Domain adaptation
:fountain_pen: Fairness Warnings and Fair-MAML: Learning Fairly with Minimal Data (UCI, Haverford)
- Fairness Warnings
- Model-agnostic algorithm that provides interpretable boundary conditions for when a fairly trained model may not behave fairly on similar but slightly different tasks within a given domain.
- Training interpretable model to predict which mean-shift causes to unfairness of a classifier.
- Fair-MAML
- Fair meta-learning approach to train models that can be quickly fine-tuned to specific tasks from only a few number of sample in- stances while balancing fairness and accuracy
- K-shot fairness, i.e. training a fair model on a new task with only K data points.
- There is more in the paper
Ranking
:fountain_pen: Interventions for ranking in the presence of implicit bias (Yale, IIT Kanpur)Subset selection Task (shortlisting to an interview)
- Rooney Rule: Select at least one candidate from an underprivileged group for an interview
- The rule increases the total utility of the selection [Kleinberg & Raghavan 18]
Ranking task
- Individuals placed later in the ranking are less likely to receive positive outcomes.
Model:
- (Unknown) Latent utility
- (Known) Position-based discount (e.g, DCG, Zipfian)
- (Unknown) Implicit bias factor
- (Known) Observed utility
Output ranking: Ranking maximizing the weighted sum of observed utility (the number of items can be less than the whole list).Goal: Design constraints on the output ranking such that it has a high latent utility.
L-constrain lower bound on the number of items from a particular group in the top-k positions of the ranking, for all positions in the output.Theoretical results
- The class of L-constraints defined above are expressive
enough to recover the optimal latent utility while optimizing observed utility constrained to certain specific L which depends on the latent utility vector for all implicit bias parameters.
- Rooney rule generalization: Given \(\alpha \in [0,1]\), the constrain \(L(\alpha)\) is defined as follow: For all \(k \in[n]\), \(L_{ka}=0\) and \(L_{kb}=\alpha k\).
Under natural distributional assumptions on the utilities of items, surprisingly, these constraints can recover almost all of the utility lost due to implicit biases
These constraints appear to be robust to deviations from our assumptions.Case-studies
- IIT-JEE (2009) dataset
- Semantic Scholar Research corpus
Case-study
YouTube Radicalization
Auditing radicalization pathways on YouTube (EPFL, UFMG)
- Large-scale audit of user radicalization on YouTube
349 channels, 330k videos, 72M+ comments
- Four groups: Media, the Alt-lite, the Intellectual Dark Web (I.D.W.), and the Alt-right.
Results:
- Shares the same user base;
- Users consistently migrate from milder to more extreme content
- Probing the recommendation system: Alt-lite content is easily reachable from I.D.W. channels, while Alt-right videos are reachable only through channel recommendations.
- Take with a grain of salt: does it really impact on people opinions and actions? Is the pipeline actually caused by the recommendation system?
Health
"The human body is a black box": supporting clinical decision-making with deep learning (Duke, Data & Society)
- Sepsis Watch, a machine learning-driven tool that assists hospital clinicians in the early diagnosis and treatment of sepsis.
- Sepsis is an inflammatory response to infection that can lead to organ failure and is the leading cause of inpatient deaths in US hospital sepsis is not only hard to predict but also lacks a universally accepted definition approached
- Considered the development of Sepsis Watch as the development of a sociotechnical system, not an isolated model.
- The model generates risk scores every hour for every adult patient to detect sepsis.
- How to build trust without ground-truth? Not with interpretability, a lot of work with all the stakeholders
Criminal Justice System
-
The impact of overbooking on a pre-trial risk assessment tool (Human Rights Data Analysis Group)
- With the new San Francisco’s newly elected District Attorney
- Data from a pilot run of pre-trial risk assessment tool
- Overbooking = booking charges that do not
result in a conviction
- Booking charges that do not result in a conviction (i.e. charges that are dropped or end in an acquittal) increased the recommended level of pre-trial supervision in around 27% of cases evaluated by the tool.
- The ultimate objective of this analysis is to assess how often “unfair" booking charges caused the PSA to recommend excessively restrictive conditions of pre-trial supervision
- Dynamic pattern of cascading disadvantage
- Perform counterfactual analysis (booking charges vs. conviction charges)
- Disaggregating the analysis by race shows that while Black individuals received unwarranted charge-based exclusions and NVCA flags at a higher rate than non-Black individuals, they did not receive increased recommendations at a substantially higher rate due to the fact that Black individuals were more likely to be classified in the higher risk groups even before charge-based increases are applied.
Data in New Delhi's predictive policing system (Article 19, Jawaharlal Nehru University)
- In-situ ethnographic study of New Delhi Police’s data collection practices (30M people)
- Crime Mapping, Analytics and Predictive System (CMAPS)
- Offering methodological considerations for studying AI deployments in non-western contexts.
- Analysis
- Bias
- Call takers resort to standardized questions about the location of the caller and do not enquire further because they are incentivized to be quick more than they are incentivized to be accurate.
- A high volume of calls might not be an indicator of high crime but a lack of access to other sections of governance for these urban poor.
- Green Diary draft with a ‘pending’ status.
- Disparate impact, or indirect discrimination
- Direct discrimination
- Hard coding arbitrariness
- Only the higher crime would be taken into consideration, which undermines the accuracy of this data to indicate frequency of crimes across the spectrum
- Opacity as a feature, not a bug
Targeted social polices
Algorithmic Targeting of Social Policies: Fairness, Accuracy, and Distributed Governance (MIT)
- Targeted social policies are the main strategy for poverty alleviation across the developing world.
- Due to the scale, diversity, and widespread relevance - among the most important algorithms operating in the world today.
- Improved a eligibility system using ML - impact on ~ 1M people in two Latin America countries
- Substantially increase accuracy.
- Absent explicit parity constraints, both status quo and AI-based systems induce disparities across population subgroups.
- Worked on tackling the lack of consensus on normative standards for prioritization and fairness criteria with decision-support platform for distributed governance
Hiring
Mitigating bias in algorithmic hiring: evaluating claims and practices (Cornell)
- Documenting and analyzing the claims and practices of companies offering algorithms for employment assessment.
- Technically, we consider the various choices vendors make regarding data collection and prediction targets, and explore the risks and trade-offs that these choices pose. We also discuss how algorithmic de-biasing techniques interface with, and create challenges for, antidiscrimination law.
Explianbility
The Hidden Assumptions Behind Counterfactual Explanations and
Principal Reasons (MSR, UCLA, Cornell) Highlighting subsets of features in the service of autonomy
- Counterfactual explanations
- Motivated by the GDPR
- The goal is to provide actionable guidance - to explain how things could have been different and provide a concrete set of steps a consumer might take to achieve a different outcome in the future.
- Counterfactual explanations are generated by identifying the features that, if minimally changed, would alter the output of the model.
- Principal reason
- Motivated by credit regulation in the US
- "Adverse action notices" (ANN)
- What counts as a principal reason is not well-defined in either the statutes or regulation.
- Principal reasons are satisfied by a broader array of possible feature-highlighting explanations.
Hidden assumptions behind the belief that these explanations will be useful for decision subjects
- Features do not clearly map to actions
- Features cannot be made commensurate by looking only at the distribution of the training data (i.e. what is the distance function?)
- Features may be relevant to decision making in multiple domains
- Models may not have certain properties: stability, monotonically, and binary outcomes
Unavoidable Tensions
- The autonomy paradox
- The burden and power to choose
- Too much transparency
Data
Lessons from archives: strategies for collecting sociocultural data in machine learning (Stanford, Google)
- FATE issues are rooted in decisions surrounding the data collection and annotation process.
- New specialization: data collection and annotation
Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection.
- "Wild west" Web crawling vs. Community archives vs. Curatorial archives
Lessons from Archives
- Inclusivity: mission statements & collection policies
- Consent: community & participatory archives
- Power: data consortia
- Transparency: appraisal records & committee-based data collection
- Ethics & privacy: codes of ethics and conduct
Case study: GPT-2 and Reddit
Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from? (UC Berkeley)
- Does "gold standard" data is reliable in the first place?
Labeling - similar to structured content analysis
- Longstanding methodology in the social sciences and humanities, with many established best practices
Examined ML paper that used Twitter data. e.g.:
- Do such best practices were followed?
- Who were the labelers?
- How did they label?
- Inter-rater reliability metrics Compensation for crowdworkers
Findings:
- Indicate concern, given how crucial the quality of training data is and the difficulty of standardizing human judgment.
- Yet they also give us hope, as we found a number of papers we considered to be excellent cases of reporting the processes behind their datasets.
Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy (Princeton) Focus on the person
sub-tree in ImageNet ImageNet data collection Pipeline
- Concept vocabulary (WordNet)
- Candidate images (Search engine)
- Manual cleanup (AMT)
Identified three key factors that may lead to problematic behavior in downstream technology:
- The stagnant concept vocabulary from WordNet (unsafe)
- The attempt at exhaustive illustration of all categories with images (imageability)
- The inequality of demographic representation in the images (gender, skin color, age)
Blog post :