Please enable JavaScript.
Coggle requires JavaScript to display documents.
IB Computer Science Case Study - Coggle Diagram
IB Computer Science Case Study
Machine learning
Supervised learning
Algorithm
Training set: Provide algorithm with categorised/ labelled data
feed the machine new, unlabelled data to see if it tags new data appropriately
Based on how well it does with the testing set, a new training set is developed and the process repeated
Suited to classification and regression problems
Unsupervised learning
data inputted without labels
Data items compared to each other and then sorted into categories
Suited to pattern/structure recognition
Reinforcement Learning
Similar to supervised learning but model receives rewards in real time
Algorithm given a goal as well as a range of actions to achieve that goal
Each action given a score based on how close it come to achieving the goal
Pushes algorithm to try different approaches until maximum score achieved
Sort of like trial and error
Recommender Systems
Content-based filtering
Advantages
Works with less data
Provides results based on activities of specific user
Disadvantages
over specialization
Emphasizes on content features
Uses content profile which includes content features
Product features oriented and so don't have problems with new products
Can provide more accurate recommendations as it focuses on features of the content a user likes
Collaborative filtering
Advantages
Other user's scores are used
No deterministic result since chance is involved in the system
Disadvantages
Needs more data
Problems with new products and users
popularity bias
Emphasizes on user preference
Requires the user profile to suggest relevant content
Feed on user ratings, reviews, thumbs ups and downs, other feedback
Products without feedback or reviews can't be recommended
New users without profile can't be given recommendations
Doesn't always ensure precise recommendations
k-nearest neighbor
Stores all available cases and classifies new cases based on a similarity measure
Classifies a data point based on how it's neighbours are classified
'k' refers to the number of nearest neighbours to take into consideration
Classified based on the majority classification of the k nearest neighbours
'k' chosen using parameter tuning
important for better accuracy
choosing k is often a case of trial and error
smaller 'k' values means that each neighbour has a much higher influence on the result
noise/ atypical data points can cause problems
Larger 'k' values make matching cases more computationally expensive as more distance calculations have to be done
can also result in over smoothing- not finding the best match for a new case
Matrix Factorisation
Matrix of users against items with their ratings for the ones they have watched
Matrix factorisation used to predict what their rating would be for ones they haven't watched and recommend them accordingly
User-Item matrix split up into Item-feature matrix and User-feature matrix using an iterative algorithm- called decomposition
values in item-feature and user-feature matrices adjusted using a cost function to give values that most accurately give the values from the initial user-item matrix
Dot product of values in user-feature and item-feature matrices to calculate new value for a new user-item matrix with all value filled in
Combination
Most systems utilize a combination of both systems in order to maximize effectiveness and reduce drawbacks
Training
train/test splits
80% data used to train recommender system, 20% used to test the syste,
mostly supervised learning algorithms
overfitting
problem where model fits too closely to training dataset
model trained to predict value in testing dataset with such accuracy that it doesn't work for any other testing data
Accuracy
Mean Absolute Error
measures average magnitude of errors in a prediction
average of the absolute difference between predicted and actual values
Root Mean Square
square root and the average pf squared differences between prediction and actual observation
Precision
fraction of relevant instances among retrieved instances
EG. a search engine returns 30 pages, 20 of which are relevant- precision is 20/30=2/3
tells us how valid the results are
Recall
fraction of relevant instances that were retrieved
search engine returns 30 pages, 30 of which are relevant but fails to return 40 relevant pages- recall is 20/60=1/3
tells us how complete the results are
F measure
a score that balances precision and recall
Cloud Computing
SaaS
Application that runs directly through you browser instead of on your computer
EG. Netflix, Amazon, Google docs
PaaS
Provides services that allow web apps to run
storage
configuration
Networking
EG. Microsoft Azure, Google App Engine
IaaS
Provides on-demand computing resources in the cloud
EG. Google compute engine, Digital Ocean
Less hand holding than PaaS
Not focused on getting an application running- focused on individual pieces of hardware needed to run a large-scale application
https://www.youtube.com/watch?v=6PZEVNuBL0g
Social and Ethical Concerns
Behavioral Data
Explicit Data
data gathered from users' submitted data
EG. rating a video clip, enters their preference, searches for an item
Implicit Data
Data the user is not aware is being collected
EG. click data, purchase data, key logging
Right to Anonymity
Keeping your identity secret
Your right to have the company/ website not share who you are with anyone else without your consent
Right to Privacy
Keeping your data private
Your right to have the company not share your data with anyone else without your consent