Will a Non-Crappy 3D Objects Version of Dall-e be Published Before June 2024?

This market is inspired by a market originally by @VictorLevoso; Will a non-crappy video equivalent of dall-e be published before June 2023?

So first off, buyer beware as this market, much like the inspiration market, may be up to interpretation and does not abide by best practices of it being something mapable by a third-party, quantitative metric with a threshold, but rather by the market maker's interpretation of evidence provided in the comments.

Recently, a paper came out from OpenAI researches called Shap-E, which is a 3D object generator. From the paper, Shap-E is an improvement over a previous model called Point-E. Whereas Point-E modeled point clouds, Shap-E uses something called Neural Radiance Fields (NeRF) which represents a scene as an implicit function.

An implicit function is a mathematical function that involves variables implicitly, as opposed to explicitly having a relationship, as is the case in an explicit function. Whereas in this function:

y = x^2

y is explicitly a function of x. Whereas in the function:

x^2 + y^2 = 1

x and y have an implicit relationship.

Point-E used an explicit generative model over point clouds to generate 3D assets, unlike Shap-E which directly generates the parameters of implicit functions. So in rendering a 3D object, Point-E uses a Gaussian Diffusion process was used to mimic how a light source illuminates a scene.

Noising Process: q(xt∣xt−1):=N(xt; sqrt(1−βtxt−1),βtI)

the state xt at time t is generated from the state xt−1 at the previous time step. The state xt is a noisy version of xt−1, where the noise is Gaussian with mean sqrt(1−βtxt−1) and covariance βtI. The parameter βt determines the amount of noise added at each time step.

Where:

xt=sqrt(αˉt)x0 + sqrt(1−αˉt)ϵ, where ϵ is a Gaussian random variable which explicitly defines xt, and αˉt is a noise parameter βt up to time t.

So what they did is reversed the above Noising Process, q(xt−1∣xt), using a neural network, pθ(xt−1∣xt), setting the goal of learning the paremeters θ of the neural network so that it can accurately reverse the noising process. So as a result, Point-E was able to generate images which are very detailed like the following:

Contrast this to Shap-E, which implicitly maps an input pair with :

FΘ : (x,d) ↦ (c,σ)

The given equation, Neural Radiance Field (NeRF), defines a function FΘ that maps an input pair (x,d) to an output pair (c,σ).

x is a 3D spatial coordinate. This could represent a point in a 3D scene or a 3D model.
d is a 3D viewing direction. This could represent the direction from which x is being viewed, such as the direction from a camera to x.
c is an RGB color. This is the color that the function computes for the point x when viewed from the direction d.
σ is a non-negative density value. This could represent the density of a volume at the point x, which could affect how light interacts with the point and thus influence the computed color.

The function FΘ takes a spatial coordinate and a viewing direction as input and computes a color and a density as output. The exact way in which it computes the color and density would be determined by the parameters Θ. So with that mathematical background understood, Shap-E was trained in two stages:

An encoder was trained to map a bunch of 3D assets into the parameters of this above implicit function (as well as another one which I won't go into here) which takes viewing direction as input and color and density as output.
A Conditional Diffusion Model was trained on the outputs of the encoder. A Conditional Diffusion Model is like a Guassian Diffusion Model used in PointE above, but conditioned on text-based descriptions of the original training images.

Coherency was evidently given priority over diversity of samples.

Shap-E was able to generate images which were, "pleasing" and did not skip out on parts of the model as opposed to Point-E, like the following:

I was able to render an image of a dog with a HuggingFace demo:

and I was able to successfully convert this into a GLTF file:

So here's the tricky part. While this does seem to be an interesting approach, the value of Dall-E and other generative AI seems to be in part the capability to create, "whatever," but you don't seem to be able to do that with Shap-E, it creates all sorts of mistakes, e.g.:.

The resulting samples also seem to look rough or lack fine details.
Further, this is not covered in the paper, but the architecture seems to be super resource heavy as far as I can tell, I tried to run it in a Colab notebook and it took forever, so this might not be, "cheap."

MARKET RESOLUTION

We will use the resolution criteria outlined in the above sourced Dall-e for video market ending at the end of May 2023 as a precedent.

Note - this resolved NO.

Here was the reasoning given:

So extrapolating that out and rather than saying, "object permanance," we might say, "object congruence." E.g. if models of a wide variety of non-cherry picked random topics have significant parts cut-off or surfaces that are not congruent, this would resolve as NO.
We can't use cherrypicked promo images.
If a sufficient quantity of generated models is poor, this will resolve as NO. We won't do a 50% resolution, unless it's clearly and unequivocably 80% of models being generated congruently.
Random sampling of objects to be generated must be sufficiently large.

#	Name	Total profit
1		Ṁ164
2		Ṁ152
3		Ṁ88
4		Ṁ87
5		Ṁ67

🏅 Top traders

People are also trading

Related questions