View in browser
Part II
Who are we?

Well, that's cool. I literally realized as I copied the email that I addressed my first email to this audience as "Self-pupped authors." Woof. Can you say tired?

I promise, I'm not that lax in my analyses. I do know how to check my mistakes. Just not at midnight, apparently, after I've been looking at SPSS software all day long. If you missed that email, you can view it here.


I wanted to include a few screenshots of key descriptives—these are the stats we use to get a feel for the range of a numeric measure. I look at the histograms too, but they can be a little unwieldy when everyone is reporting slightly different numbers. And descriptives are something I use to create different versions of variables called z-scores, which I'll explain in a moment. 

All right. That right there is a chart with the "descriptives" of some of the main numeric variables we recorded. I'm going to talk about what I notice about each one. 

Up top, we have a few measures of platform size: newsletter size (NLSubscribers), Facebook page size, Facebook group size, and Facebook friend count. Yes, I could have asked for others (Twitter, Instagram, etc.). If there are signs that a particular social media account really affect sales, that might be a question to track overall. But current logic says that the majority of interaction and advertising happens on Facebook right now (with a good chunk to AMS and Bookbub). And it's where many of us build our primary platforms. So I started there. 

The N-value is the number of people who answered that question. Notice how they aren't the same for every one? It's because some people didn't answer each of the questions. That's typical, but it's also why I wanted a large enough sample size to account for missing people. Our list wise valid N with all of these variables is 120, after all participants with missing data in ANY of those measures are "deleted." This means if I want to keep my prediction model solid, I'll have to keep the variables down to about 9 or 10 max. Hopefully it doesn't come to that. But this is why it's good to answer ALL questions in a survey. You might not think there is a connection, but the researcher probably does. If you don't answer, your data is not included in that assessment.

Minimum and maximum are pretty self-explanatory (if they aren't, let me know). Mean is another word for average—all of the reported values in the sample divided by the number of them. It's telling, for instance that the mean for Revenue is $80k. Even without looking at a histogram, that tells me there are a lot more low-earners than high earners—otherwise it would be closer to 1.25 million, halfway between the minimum and maximum. It also tells me that the high earners are outliers. As much as we all want to be like them, they might skew our results. But we'll keep them in because we want to know their patterns. 

The last thing to really understand here is that column for standard deviation. This is a hard one to explain without a chalkboard, but I'll do my best with the graphic I stole from Wikpedia. 

Remember those bell-curves from the last email? Well, math tells us that you can split each side of that bell curve into three equal parts per side called standard deviations (that little Greek circle shape). The population that makes up a normal distribution (bell curve) will hang out in one of three parts. Approximately 68% of a given population hangs out within the first part—that's the first standard deviation—on either side of the mean (the average—in this graphic represented as zero). Another 27-ish percent hangs out in the second part after that—the second standard deviation. Another then the last just under 4% is in the 3rd standard deviation or above. Anything outside of those first two standard deviations is considered really outside the norm. Those of you with kids—this is usually how gifted and talented testing works. In the Seattle schools, for instance, kids are not allowed into those classes unless they test into the third standard deviation, or the "98th percentile." Just a very real-world application of standard deviations (also, my kid is starting kindergarten this year, so I'm thinking a lot about this stuff). 

Returning to our figures, if you look back at the descriptives, the standard deviations tell us other things besides the gifted and talented potential of our metrics. For ad spend, for instance, while the average monthly ad spend for our sample is $2506 per month, 67% of the sample (if it's normal) spends up to $6817 more or less than $2506 a month.Standard deviations also offer another way of seeing that this particular metric has pretty massive range. Although this is a normal distribution (one-sided, but mostly normal), that range, like revenue, is HUGE. On the histogram (this one is computed as yearly instead of monthly), there are more evident outliers and large gaps in spending. Check it out: 

This suggests like the income pie chart from the last email, that there are pretty different patterns happening with different groups. BUT, it's also pretty consistent with the way the revenue histogram pans out. See?  

And you'll notice that when I create a new variable to get a picture of profit, that histogram becomes pretty nice and normal. Still, the range remains pretty massive with a giant chunk of the population clustered right around the center. This isn't actually surprising to me for a couple of reasons. A: Check out the mean—for the Net Revenue histogram, it cuts in half. Considering how many ads consultants out there I see preaching a 2:1 return on investment for advertising, this doesn't surprise me AT ALL that ad spend would essentially cut revenue in half. It also tells me that although maybe a few people reported their profit, not revenue, the majority did actually provide their gross. Which makes it a fairly accurate representation. 

What I find more alarming is that the mean net revenue ends up hanging out right around zero.  Not surprising, but alarming. Given how wide the range is on this (the standard deviation is still over two hundred thousand, that means people on the edges of this distribution are doing something REALLY intense (positively or negatively) with their ad spends that impact the larger representation. We will see what that is as we start correlations. 

One last thing about standard deviations: we like them because it make it possible to put variables together that weren't measured in the same units. If they are numerical, we can do it. 

I'll give you an example:

This is an early correlation matrix I did on the data to look for potential overlap between measures. I'll talk more about correlation in another email, but here's what you need to know right now: those numbers in the red box are way too high to be included in a model together. There are others that are too high too, but those together I saw again and again. What they tell me is that those are potentially measuring the same thing, or something very similar. Revenue, Newsletter Subscribers, and FB page followers "correlate" strongly enough that if I included them all in the same predictive model, they would probably wash each other out. OR, if one were the main outcome (the thing we want to predict), the model would be stronger than it really is. I don't want that. It's the statistical equivalent of my husband telling me I don't look tired after I've been writing all night for three days straight. It might make me feel good when it's just us, but if I'm going to a party and potentially having my photo taken, I'd probably prefer honesty. Later on, when I see the suitcases under my eyes plastered on Instagram, I'll know it was a lie and wish he'd just handed me my concealer instead.

Did I lose you there? Hopefully not.

Back to the lying husbands on the correlation table. Instead of keeping all of them, what I'll probably do is create a "composite" variable out of those. But since they are all measured in different units—two in people, the other in dollars—I have to convert them all first to standard deviation scores (we call them "z-scores" in stats), and then create a variable out of their average. 

So, I did it, and I called it "Author Power"—the combination of platform reach and revenue. When I tested the reliability of this metric, it came out VERY high (for those of you familiar, it had a Cronbach's alpha = 0.903). That's really, really good, and makes it potentially a better measure for success than any of those measures alone. I can't say for sure if that's the best metric, since maybe we want to know about revenue specifically, but it's one we'll definitely try.

Okay, that's probably it for this week until I get my words in. I love playing with data, but it distracted me this weekend to the point where I'm now really behind on my word count. My beta readers are not very happy with me right now. 

Let me know if you have any questions about any of that stuff or additional thoughts. I'll try to be less lecture-y next time when I talk about correlations.



Author Analytics with Nicole French

You received this email because you signed up on our website or made a purchase from us.