Please enable JavaScript.

Coggle requires JavaScript to display documents.

IB Computer Science Case Study - Coggle Diagram

- - - - Training set: Provide algorithm with categorised/ labelled data
        
        feed the machine new, unlabelled data to see if it tags new data appropriately
        
        Based on how well it does with the testing set, a new training set is developed and the process repeated
  - - - Data items compared to each other and then sorted into categories
  - - - Algorithm given a goal as well as a range of actions to achieve that goal
      - Each action given a score based on how close it come to achieving the goal
        
        Pushes algorithm to try different approaches until maximum score achieved
      - Sort of like trial and error
- - - - Works with less data
      - Provides results based on activities of specific user
    - - over specialization
  - - - Other user's scores are used
      - No deterministic result since chance is involved in the system
    - - Needs more data
      - Problems with new products and users
      - popularity bias
    - - Products without feedback or reviews can't be recommended
      - New users without profile can't be given recommendations
    - - Stores all available cases and classifies new cases based on a similarity measure
        
        Classifies a data point based on how it's neighbours are classified
        
        'k' refers to the number of nearest neighbours to take into consideration
        
        Classified based on the majority classification of the k nearest neighbours
      - 'k' chosen using parameter tuning
        
        important for better accuracy
        
        choosing k is often a case of trial and error
        
        smaller 'k' values means that each neighbour has a much higher influence on the result
        
        noise/ atypical data points can cause problems
        
        Larger 'k' values make matching cases more computationally expensive as more distance calculations have to be done
        
        can also result in over smoothing- not finding the best match for a new case
    - - Matrix of users against items with their ratings for the ones they have watched
        
        Matrix factorisation used to predict what their rating would be for ones they haven't watched and recommend them accordingly
        
        User-Item matrix split up into Item-feature matrix and User-feature matrix using an iterative algorithm- called decomposition
        
        values in item-feature and user-feature matrices adjusted using a cost function to give values that most accurately give the values from the initial user-item matrix
        
        Dot product of values in user-feature and item-feature matrices to calculate new value for a new user-item matrix with all value filled in
  - - - 80% data used to train recommender system, 20% used to test the syste,
    - - problem where model fits too closely to training dataset
      - model trained to predict value in testing dataset with such accuracy that it doesn't work for any other testing data
  - - - measures average magnitude of errors in a prediction
      - average of the absolute difference between predicted and actual values
    - - square root and the average pf squared differences between prediction and actual observation
    - - fraction of relevant instances among retrieved instances
        
        EG. a search engine returns 30 pages, 20 of which are relevant- precision is 20/30=2/3
        
        tells us how valid the results are
    - - fraction of relevant instances that were retrieved
        
        search engine returns 30 pages, 30 of which are relevant but fails to return 40 relevant pages- recall is 20/60=1/3
        
        tells us how complete the results are
    - - a score that balances precision and recall
- - - - storage
      - configuration
      - Networking
  - - - EG. Google compute engine, Digital Ocean
- - - - data gathered from users' submitted data
      - EG. rating a video clip, enters their preference, searches for an item
    - - Data the user is not aware is being collected
      - EG. click data, purchase data, key logging