How Much Would the Ideal Dataset Cost?
Dear JD Vance: Please Spend $500 billion on gene sequencing every American
For a while, I’ve been frustrated with the sample sized available in sociobiological literature. We’re missing so much important data — GWAS, for example, is underpowered by a factor of at least 1/10. I groan every time some armchair analyst pulls back out the GSS — a beaten-to-death, ancient, and extremely underpowered dataset.
Social science, meanwhile, is rife with p-hacking. What if there were a way to solve all of these problems forever? I think there is — and now is the time to talk about it.
People are saying that JD Vance reads X.com and could somehow fund HBD. They are making modest proposals like “give a little more money to HBD.” But will this solve the underpoweredness crisis and the replication crisis?
I have an idea that will: measure every American.
That’s right — why worry about sampling representativeness and power when you can just measure all 300+ million Americans?
Measure what?
“Measure” is vague, of course. What are we measuring? There are some obvious necessities:
IQ, with a 1 hr+ long test
Full gene sequencing
Standard demographics (race, age, sex, income)
Big 5 personality
Performing full gene sequencing will allow us to make all of this data family level, because gene sequencing can identify relatives. However, we might also want people to report who they think their parents are, so we can identify adoptees:
Social parent identifiers
Social offspring identifiers
Now we can identify adoptees and their social parents and siblings, as well as their biological family through gene sequencing.
And as somebody who is interested in the origin of political beliefs, we have to add something which is almost always missing from large public datasets:
Political self identification
Measures of conservatism
Measures of economic orientation / authoritarianism
This is everything I am interested in at this point in time. Of course, if you were to do this, you should probably get medical histories as well. Ideally, everyone who uses these kinds of datasets would be able to make their own suggestions on what to measure.
How much would this cost?
Whole genome sequencing costs $600/person right now. The phenotypic measurements can easily be done in less than one 8 hour day.
If there are 330 million Americans, the gene sequencing will cost $198 billion. If we make the Day of Measurement paid as well as a typical stimulus check, that would be $1000/person which would be an additional $330 billion.
Consequently, an upper bound for the cost of the Final Dataset is $528 billion.
Every year, the US wastes at least half that on funding high school. This means the Final Dataset, which would be amazing if collected just once, would only cost 2 years worth of the returns to abolishing high school. After those two years, we could move on to things like free embryo selection and “free” nuclear powerplants.
The final dataset is actually very cheap compared to government spending. Why should we data people want for anything ever again when this is the case?
What about human rights?
Many people would obviously oppose this proposal, but that doesn’t mean it’s wrong. I’m betting with the right messaging, most people would cooperate voluntarily with data collection. Even 10% of people would be a huge win. Some preliminary surveys could tell us how successful the following fully voluntary plan would be:
Coordinate using Trump’s Big Tech backers to advertise the Final Dataset, attempting to make it high Status©
Auction the stimulus money monthly. It can take a while to collect the data. Start with paying people $500 dollars. In theory the high time preference, low money people will come in first for cheap for quick money. By the end, pay several thousand dollars for a session to lure in professionals.
Alternatively, offer more money based on demonstrable opportunity cost. Implement this as a surprise after getting as many people as possible for $1000.
However, we don’t have to only collect data voluntarily. America is not a voluntary country. Let’s consider some of the involuntary processes Americans will be familiar with:
The school system: high school is somewhere between 4 years of house arrest and a low security imprisonment for the crime of being in your teens. It is almost completely involuntary. University is pseudo-voluntary is and like a massive fine combined with 4 years of community service for the crime of being 18 to 22. You can opt out of the community service but you usually end up paying a bigger fine throughout your life to compensate.
The prison system: there are plenty of NAP violating laws that put people in prison for various things with no identifiable victim
The driving system: America is car country and these are regulated about how you would expect them to be in the Soviet Union, if they were a car country.
Now, of course, there are arguments for all these, but none of those arguments are consistent with a principle of absolute voluntaryism. They all boil down to decency and second order effects on society. I am actually not against these arguments on principle, but they tend to be made up by laymen with no actual empirical basis. For example, if school really did everything laymen claim it does, the system would probably be better than a more libertarian one. However, the layman claims about schooling don’t survive empirical auditing. It’s probably the same for most driving regulations and public decency laws.
Funny enough, the Final Dataset would provide enough power to definitively test a lot of these ideas. If we accept that violating voluntaryism is necessary in some cases, but is wrong when it’s not necessary, then it’s necessary to know what the cases are where it should be violated. The Final Dataset is the way to find this out, so it’s necessary to violate voluntaryism to collect it.
Nobody, then, who is not a strict libertarian, can object to the collection of the Final Dataset on consistent grounds. They can only do so on vibes — which they would certainly do. Because as we all hopefully know at this point, the layman does not engage in verbal ethical reasoning like this, they just follow vibes, which are mostly genetic instincts. These instincts, when really scrutinized, are probably what most would consider to be “bad” or “evil” when consciously assessing other people, and thus they must be hidden under weak assertions of untrue empirical propositions. But, they are so common that they dominate are law.
Likewise, these evil instincts do not want to be exposed to the light, and for that reason the collection of the Final Dataset would be opposed by the majority of people. And there is no arguing with instincts, at the end of the day.
This is why more realistic proposals will ultimately have to be followed, but what is life without dreaming?
What about starting with mandating that convicted felons do so?
Most states require convicted felons to give DNA samples. As far as I know, this is exclusively used for identification of DNA samples from crime scenes, but why not use the data to identify gene clusters that predispose persons to commit violent crime.
This overcomes civil liberties objections and gives us a very alerted sample size of people with the worst DNA. I guarantee that we will learn a great deal of actionable knowledge.
More carrots, less sticks. You need to give people some investment into the project beyond just a one off payment, e.g. subsidized access to health services developed using the database; ownership of a token airdropped to participants in it; etc. People will also have privacy concerns so you'll need to involve the zk specialists in its development.