Please enable JavaScript.

Coggle requires JavaScript to display documents.

FFReLU symmetries, Spaces \(\Theta, \Theta/G, \mathcal F\) The…

- - - - Rescaling representatives
        Minimal norm aka cone representative
        Shallow: analytic formula
        FF: convex optimization and block coordinate descent algorithm
        The cone \(\mathcal H(0)\) is a particular set of representative: each neuron has equal input and output, it is connected for all topologies
        Unit input norm representative
        Extracted in one pass with rescaling
        [neyshabur2015anips, neyshabur2015, pittorino2022jsm]
        
        Rescaling & permutation representative
        It's possible to define it up to equal inputs for isomorphic neurons, for ex by fixing layer by layer the ordering with the rule: the first neuron has the largest input weight (tie breaker: second largest etc...)
        However, this leads to strange behavior: the \(L_1\) distance between two networks can become discontinuous w.r.t. a parameter
        
        Th equal input does not happen
        Might be possible to prove that equal input of weights does not happen under small learning rate [yang2025]
        However in infinite time, the inputs of two neurons can converge to the same limit [martinelli2025a]
        
        Architectural + instance level representative
        
        Architectural + instance + data level representative
        
        Th architectural distance
        Define a permutation invariant and rescaling at least order-invariant distance between parameters
        Could also be tackled by justifying the cone as the right representative on which to compute the permutation invariant distance (cone weights inputs and outputs the same)
        
        Exp parametric distance correlate with functional distance
        Symmetries are two non-equal parameters implementing the same function. By default this results in positive parameter distance and null functional distance. Our goal is to mitigate this phenomenon.
        
        App Backdoor detection
        A backdoored network cannot be detected by functional queries. However it could be detected by parameter evaluation, maybe by distance with a non backdoored network.
        Probably requires instance levels symmetries.
        
        Toy exp: ReLU net implementing the identity except on a small interval \([x,x+ \epsilon ]\)
        
        Exp: fit a white pixel in a corner of MNIST to give 1 + correct class
        In both exp, implement the backdoor by learning, not handmade
        
        :check:Exp toy model with 2 functional minima
        With rescaled representative, the parameter space clusters are more consistent with functional clusters, visualization with U-map and PH.
        Sidenote: occurrence of topological obstruction [nurisso2024a]
        
        Conjecture: drift increases c values
        Discretization makes the initial c-value (position on the orbit) evolve. Empirically it seemed that there was a consistent bias towards increasing c. Might be proved under some assumption with differential geometry.
        This is interesting because it means that (1) position on the orbit is related to training and (2) gives another argument on why it is necessary to account for symmetries when doing parametric evalutaions in certain cases and not in others
        
        :check:Regularization towards the cone
        \(L_2\) and (probably) \(L1\) regularization drive the parameter exponentially fast towards the cone. This is an argument for why the cone is "special".
        Moreover, having the ability to rescale to the cone allows to study more precisely the effect of regularization, disentangling it from minimal norm representation, meaning the cone representative has independent value for other works.
        Remark: the cone is stable under regularization like L2/L1)
        
        :check:Th analysis of dissimilarities of shallow 1 hidden neuron networks
        In this case, no permutation, no dead neuron if input is both neg and pos
        
        Different representatives give different expressions (e.g. interpretation of the cosine dissimilarity)
        
        Is it not straightforward to relate these expressions to functional distances even in toy cases
        
        Th extend the analysis to non-trivial networks
        
        multiple inputs (bias), outputs
        
        multiple hidden nodes
    - - Permutation-invariant OT distance
      - Exp align a set of networks at the population level: find a set of permutations which optimize a population level cost (probably needs heuristics). This would allow embedding of a fixed set of network, but any addition of a network could potentially change all other representations