If I resign or am laid off from Microsoft Canada in 2023, will I financially break-even-or-better for the year?
7
209
150
resolved Dec 31
Resolved
N/A

This question resolves as:

  • Positive if I leave the company and make more in total compensation (stock, base, bonus) than I would have by staying.

  • Negative if I leave the company and make less than I would have by staying.

  • Ambiguous if I have not left the company by December 31st, 2023.

What counts as leaving the company:

  • Being impacted by a layoff, even if my termination date is not in 2023.

  • Quitting my job at any point, even if I "boomerang" back later.

  • Being fired for cause.

  • Joining a company owned by Microsoft (LinkedIn, Github)

  • Death or permanent disability (in which case insurance pays out).

What does not count as leaving the company:

  • Internal transfers to another team or business group.

  • Leaves of absense, e.g short term disability leave, sabbaticals.

  • International transfers to work for Microsoft in another location.

  • Interviewing around and then accepting a counter-offer to stay.

  • Being removed from an initial layoff list thanks to exec intervention.

Assumptions defining how much I would make by staying:

  • Only consider employment income reported on Form T4 (Statement of Remuneration paid).

  • On-hire RSU vesting continues as scheduled.

  • The number of promotions I get in 2023 is strictly less than 2.

  • Bonus% is somewhere between target and what I got last cycle.

  • Merit% adjustment does not take inflation rates into account.

  • I do not earn any out-of-band discretionary special stock awards.

Relevant links:

I've stayed at MS for so long since I firmly believed it was the best way for me to help society, and I want to do AI safety/interpretability/alignment full time now since I think that's higher expected value overall. If I get rejected from every organization doing cool things in AI alignment, then I'll find an awesome cofounder, incorporate a new entity, and try to get that funded.

LinkedIn for those who care: 22 y.o, SWE II, 3 years of tenure at MS (2 years FTE, 1 year as a contractor), 5 internships, multiple awards in international competitions, IEEE conference speaker, completed undergrad early, got into Stanford, active on /r/mlscaling, met Geoff Hinton once, on a closed work permit but will get a Canada PR card soon, no publications yet but co-authoring a paper to submit to NeurIPS23, good network of contacts.

Get Ṁ200 play money
Sort by:
predicted YES

Didn’t leave yet.

bought Ṁ3 of YES

How was meeting Hinton like? Did you have a talk with him?

predicted YES

@firstuserhere I was visiting the Vector Institute and he was giving a talk on the forward-forward algorithm, which tears out backpropagation. Stuck around after and asked him a few questions. Things that surprised me:
- He still cares about what happens in the brain, and neuroscience still informs his decisions when designing artificial neural networks.
- I asked him, "When we replace a single energy function for an entire layer with a spatially local one, how does this affect the interpretability of how the features map to neurons", but it seemed he hadn't really thought about it.

- People still care about MNIST??
- Feed-forward has two advantages over GANs: eliminating adversarial competition and eliminating mode-collapse. I wanted to know what would happen if you did RLHF on a large feed-forward net, but Hinton still has to get permission from Google to do big experiments instead of them letting him do what he wants.

predicted YES

@SheikhAbdurRaheemAli *forward-forward, not feed-forward

predicted YES

@SheikhAbdurRaheemAli Makes sense that MNIST is v popular, it'll be one of the first datasets anyone new entering the field will play with.

Interesting that he's using neuroscience frameworks to get new ideas because that's probably been a v useful thing throughout his career.

predicted YES

@SheikhAbdurRaheemAli I'm trying to figure out what you mean with the 3rd point on replacing layer's energy function with a local one. I get a surface level idea of what you're trying to say, would you mind explaining a bit more on that?

predicted YES

@firstuserhere okay, setting a 15-minute timer, based on stalking your profile for a few seconds I'm assuming you have a technical background and are familiar with AI but aren't a full-fledged researcher.

I shall attempt to construct an inferential pathway that prioritizes brevity and simplicity over completeness, in the hopes that you'll fill in the gaps in my explanation that can be parsed from context and tell me what still feels confusing so I'm aware of the places where I skipped too many steps at the expense of clarity.

First, an optional background link: Energy based model - Wikipedia

In the context of neural networks, an energy function for a layer is a mathematical function that maps the inputs to a scalar value that represents the "energy" or "cost" of the inputs. It is commonly used in energy-based models, such as the Hopfield network or the Boltzmann machine.

Boltzmann machines were invented in 1983, they do something called contrastive learning, where you use positive and negative examples to avoid modelling the network's own wiring. They didn't really catch on since they use markov chain monte carlo sampling to get their internal states, which can get unwieldy. An interesting property is that we don't need backpropagation to get their gradients, which makes them historically relevant when considering the design space of neural networks without backprop.

The goal of the energy function is to provide a way to evaluate the compatibility between the input and the model's parameters. In other words, given a particular set of input values, the energy function calculates a score that reflects how well those inputs match the patterns that the model has learned.

Forward-forward also works by trying to discriminate between real and fake data. In contrastive learning, the attempt to discriminate between positive and negative data causes the hidden units to have different weights. But if you're doing e.g image classification, then hidden units in very different image locations will have different weights anyway.

During training, the model adjusts its weights and biases to minimize the energy function. This process, known as energy minimization, helps the model to learn to recognize and classify patterns in the input data. Since each location learns a different pattern, it is a little silly to have a single energy function for a whole layer, so Hinton argued that we should instead have a separate energy function in each location in space. This is currently still under investigation, but you can imagine it as evaluating you based on how good you are at your job, instead of making everyone take the same test regardless of what their job is.

Okay, timer is over, so I'll start wrapping up:

A feature map is a two-dimensional array or matrix that represents the outputs of a layer of neurons after it has processed a specific input. Each element of the feature map is often referred to as a "feature," and it represents the activation of a particular neuron in the layer.

In a CNN, for example, the feature maps are obtained by applying a set of filters to the input image or other input data. Each filter is a small matrix of weights that is convolved with the input to produce a single output value. By applying multiple filters to the input, the CNN is able to learn and detect various features of the input, such as edges, corners, and textures.

So the question was basically, "Hey, if you think each spatially distinct image location should have a different energy function associated with it, is it now easier or harder or the same difficulty (after/during training) to poke at the model and say, "this set of neurons lights up when it sees an edge, guess those guys are an edge-detector, and we think this other set of neurons detects this other pattern due to our analysis of the activations, and so on"."

predicted YES

@SheikhAbdurRaheemAli That's a correct assesment. I do have a technical background and have done a few machine learning projects during internships (and a short survey paper kinda thing (never published) on extreme classification). But I'm not a full fledged researcher. It's been like a year atleast since i had anything technical to do directly in ML, and I'm trying to switch my role from software development (specifically building software tools and statistical analysis toolkits for cybersecurity stuff) to ML engineering/research.

These days I'm trying to upskill myself. For example, currently I'm in the middle of a 10-day solo sprint to understand mechanistic interpretability's basics and have a framework in my mind so i can attempt some low hanging problems.

Okay, that said, let me try to understand what you have written (thanks for taking out the time for that)

predicted YES

@SheikhAbdurRaheemAli Let me first try to write my understanding of what you explained in simple terms:


So, we have an input space x and an output space y. 


Instead of trying to find the probability distribution over y (given x)(which can be expensive, because you’re calculating probability for each y, if the output space is super large, or if for a single output there’s a large sample space), we try to find regions in the output space that are compatible with x. 


We have a function f(x,y) which represents the “energy” of the network (inputs?). A way of narrowing the output space for a new x. It tries to find regions in the y space that are compatible with x more than others. If f represents translation then regions in y which are closer to translation of x will be low in f(x,y). If our model has learned a good translation of some types of sentences x, then f(x,y) will be low for them.


Now, one reason I think this can be useful is if we want to have multimodal networks or a combination of different networks trained on different data, independently, but which are modeling the same problem in different dimensions. Or if we want to concatenate decision models with different types of inputs. Then stitching together such networks seems easier if we use the energy functions instead of probability distributions directly.


During the training, the model attempts to minimize the energy function and in that process learns by proxy to classify some inherent patterns in the input data. Since we are learning the compatibility between input and output space for some metric/energy function f, each layer having its own energy function would give us a distribution of not only compatible ys for x but also [something]. 


Hinton’s argument was that we should have a separate energy function in each location in space. In your analogy, to evaluate people at their job, you can either have a test of the job or you can have a test of the person. People who pass the test of a job will be ok but it’ll mean that everyone has the same test even if they’re vastly different. But if you design a test for each person, then you have to create a lot of tests/energy functions even if it’ll make the best-fit for each person faster/better.

That’s my understanding of what you wrote so far.

I don’t really grasp this line “Since each location learns a different pattern, it is a little silly to have a single energy function for a whole layer, so Hinton argued that we should instead have a separate energy function in each location in space.” Would that not be a lot of energy functions for a lot of locations? When we see a new example x, only regions in the output space that are compatible with x over f will have their activations light up instead of us calculating probabilities for each y, BUT, we’re basically front-loading the computational load, aren’t we?

predicted YES

@firstuserhere Just read this! Need to sleep soon but I'd check out Lowe et. al (2019) from the references in [2212.13345] The Forward-Forward Algorithm: Some Preliminary Investigations.

More in-depth response if I have spare time during my reply-to-people time block tomorrow.

I think there should be a distinction between getting laid off and leaving voluntarily. From the recent Amz layoffs, two of my friends came out looking better financially than they'd have if they'd not been laid off, within 2 months lol

bought Ṁ30 of YES

@firstuserhere following convention from resolution criteria of:

I agree that the conditional probabilities may look different for layoffs and leaving, but I wanted to keep things simple, so I only made one market.