# Evaluation of multimember score methods

This is a break off from this other post

I want to dig deeper into evaluation methods. In multi-member sequential score systems we have all our basic criteria other than accountability and local representation covered. This means we are ahead of all other “PR Systems” but there are still many such systems. There are 6 in the post but there are many others which come in a allocation or optimal types. So how do we decide which is “best” and what are the the metrics we can evaluate under. So far in the post we have come up with the following metrics

Total Utility = The total amount of score which each voter attributed to the winning candidates on their ballot
Total Log Utility = sum of ln(score spent) for each user [ I actually use ln(1 + x) to avoid infinities]
Total Favored Winner Utility = The total utility of each voters most highly scored winner
Total Unsatisfied Utility = The sum of the score left to be spent by all voters. ie 5 - sum(score) for each user then summed over all users
Fully Satisfied Voters = The number of voters who got candidates with a total score of 5 or more
Wasted Voters = Voters who did not get any candidate they scored

The problem is the same as the reason why I put “PR Systems” in quotation above. Basically there is no real way to define PR in a non partisan sense. The standard checks are just that it fits partisan PR in that limiting case with bullet voting. I have been using the term “Ideal Representation” to refer to a fair down sampling like what you would get from a sortition. There is no real way to calculate this. Even if we had all the voters and candidates position in the ideological space the sample size is too small to do something like a chi squared test for independence.

So the question is, what are other metrics which would probe something like PR? The above ones have Utility sliced in many ways but I feel like we are missing the issue of spread/variation.

I was thinking of just looking at the the variance of the scores for the winners. When comparing two systems the variance of the score would be a decent relative metric for the polarization/centerist bias. This is not really the same thing as PR since MMP is polarizing and considered 100% PR. However, it could at least be a way to know if a system is polarizing. STV for example could be used as a reference point of a polarized system. Specifically the metrics for comparing the methods could be

1. The the standard deviation of all scores for each individual winner. One metric for winner 1 then 2 then 3 ect I would expect the standard deviation to get larger for subsequent winners.

2. The mean of the above across all winners. This would give a metric for the system as a whole since the order of the winner set is not important.

3. The standard deviation of all scores given to all winners.

This way we can at least see if the variance of the scores is too high. I am using standard deviation because it is the standard. Median absolute deviation and Average absolute deviation are also options but I am not sure if they are better. Thoughts?

The best PR metrics are their own voting methods. If you are looking for more “PR” metrics, why don’t you turn to some of the already established optimal voting methods to grade the election results.

While the harmonic and PSI quality functions (especially PSI) should yield similar results to the log function, Monroe’s quality function would not and computing Monroe’s quality function for the special case in which there are only 2 winners is trivial: just order the voters by lowest to highest value of x where x is the utility that voter has for the first winner minus the utility she has for the second and then compute the sum of utilities given to the first winner in the hare quota of voters with the lowest values of x plus the sum of utilities given to the the second winner in the hare quota of voters with the highest values of x. You can also ask Warren about some other polynomial time algorithms for some other special cases of Monroe.

You can also measure how different methods preform under a non-monotonic approximation of the least squares quality function (use either the KP and/or coin-flip transformation to convert utilities into approvals) where candidates split weights evenly among their voters and the cost (ant-quality) is the total of all the squared weights of each of the voters.

That only works for 2 winners and it is still basically a partisan PR method where we approximate people a bullet voters. It is likely better to look at the standard deviation of the total utility and max utility over alll winners for each voter.

Wasn’t there also an idea that you could specify several issues, have candidates and voters with various stances on each issue, then use the overall legisature’s stance on each issue to calculate each voter’s BR?

1 Like

Yea sure but that requires a mapping between that space and score. The simulation has a 2D space. An issue could be a line dividing the space. If we have many such lines the system where the majority of winners and the majority of candidates are both on the same side the most often is the best system. You may have simething there. We need an approximation so we do not need to actually draw a ton of random lines.

Are you saying you couldn’t:
randomly sample the voters with the sample size to be the number of winners elected, then have each voter mapped to their highest-utility candidate in the winner set, like the Hare Quota-elimination idea, then record how high the utility is vs. the total utility of the actual highest-utility candidate of each sampled voter among all candidates? You could run this sample multiple times to get a more robust result, or take a larger sample and use the Hare Quota idea with only a couple of voters in each “quota”.

As I understand it, a voter’s highest scored winner could be different from their actual favorite among the winner set due to some kind of strategy or score distortion.

The particular idea is Madison’s Cross-cutting cleavages. It was used to argue against partisanship since all parties will group people and cause this to fail for some people when they are assigned to parties. This implies Partisan PR is not consistent with “Madison PR”. This is an interesting thing to think about but is too computationally expensive to simulate.

Anyway, after a fair bit of thought and feedback from a number of people I have come up with a set of metrics I am happy with. There are 6 metrics which are measures on Utility and 6 which are measures on variance/polarization/equity.

I will include a python code for each based on a pandas dataframe of scores, S, with the winners as the columns and one row for each winner.

~Utility Metrics~

Total Utility
The total amount of score which each voter attributed to the winning candidates on their ballot. Higher values are better.
S.sum(axis=1).sum()

Total Log Utility
The sum of ln(score spent) for each user. This is motivated by the ln of utility being thought of as more fair in a number of philosophical works. Higher values are better.
np.log1p(S.sum(axis=1)).sum()

Total Favored Winner Utility
The total utility of each voters most highly scored winner. This may not be their true favorite if they strategically vote but all of these metrics assume honest voting. Higher values are better.
S.max(axis=1).sum()

Total Unsatisfied Utility
The sum of the total score for each user who did not get at least a total utility of MAX score. Lower values are better.
sum([1-i for i in S.sum(axis=1) if i < 1])
NOTE: The scores are normalized so MAX = 1

Fully Satisfied Voters
The number of voters who got candidates with a total score of MAX or more. In the single winner case getting somebody who you scored MAX would leave you satisfied. This translates to the multiwinner case if the one can assume that the mapping of score to Utility obeys Cauchy’s functional equation which essentially means that it is linear. Higher values are better.
sum([(i>=1) for i in S.sum(axis=1)])

Wasted Voters
The number of voters who did not score any winners. These are voters who had no influence on the election (other than the Hare Quota) so are wasted. Lower values are better.
sum([(i==0) for i in S.sum(axis=1)])

~Variance/Polarization/Equity Metrics~

Utility Deviation
The standard deviation of the total utility for each voter. This is motivated by the desire for each voter to have a similar total utility. This could be thought of as Equity. Lower values are better.
S[winner_list].sum(axis=1).std()

Score Deviation
The standard deviation of all the scores given to all winners. This is a measure of the polarization of the winner in aggregate. It is not known what a good value is for this but it can be useful for comparisons between systems.
S.values.flatten().std()

Favored Winner Deviation
The standard deviation of each users highest scored winner. It is somewhat of a check on what happens if the Cauchy’s functional equation is not really true. If the highest scored winner is a better estimate of the true happyness of the winner than the total score across winner. Lower values are better.
S.max(axis=1).std()

Average Winner Polarization
The standard deviation of each winner across all voters averaged across all winners. The polarization of a winner can be thought of as how similar the scores for them are across all voters.
S.std(axis=0).mean()

Most Polarized Winner
The highest standard deviation of the winners across voters. The winner who has the highest standard deviation/polarization.
S.std(axis=0).max()

Least Polarized Winner
The lowest standard deviation of the winners across voters. The winner who has the loweststandard deviation/polarization.
S.std(axis=0).min()

What does this mean?

It’s also possible that the proportionality (number of a voter’s favorites elected) plays a role. Can that be simulated with standard deviation of utility somehow?

Overall, I think we can figure out the best PR system by balancing

And

I’d say some deviation should be tolerated to give some voters greater utility, but because it’s PR, equity becomes a far more important concern, because you could technically have some voters vote strategically (min/max) and get more utility, and then others might feel bad they were denied utility and try to emulate that strategy too. I’m not sure though, so I’m happy to look at all 6 metrics. I think a 7th metric comparing the similarity of the results of these systems against a Sequential Monroe-type result would help though.

The hare quota is calculated based on the number of voters so it can’t really be said they play no role in the election. If they do not show up then the results would be different.

I think that is somewhat covered but maybe there is a metric which is more on point for that idea.

That is just another system. There is no reason to think it is a gold standard.

Perhaps run the scores given to the winners on the ballots through an algorithm that weights the upper end of the scale more heavily, so that a voter getting a candidate they scored a 5/5 is worth more than 5x as much as getting 5 candidates who are 1/5s, and then show the first 3 metrics before and after the algorithm.

I can of course do something like that is the sort of the opposite of doing the log of the total score. However I am not sure which is the best function to use. We could choose anything e^(Score), Score^2, Sin(Score),ect. Nothing is really motivated so if we get a difference between systems it will be hard to interpret.

I was thinking about using the number of max scored winners but ultimately decided against it.

I’d recommend 1.5 x scores above the midrange value (so 6 to 10) and 1 x scores at or below the midrange value (0 to 5).
This is based off Marylander’s idea:

Lets wait and see results for current metrics and come back to this.

Don’t you think that many people will get this confused with “wasted votes” which is something else entirely? I still think this metric would most accurately be described as “maximally unhappy voters” or something similar.

Ok fine ill change it.