goji berries

Peer to Peer

20 Mar

March 20, 2017April 12, 2017 editor

Peers are equals, except as reviewers, when they are more like capricious dictators. (Or when they are members of a peerage.)

We review our peers’ work because we know that we are all fallible. And because we know that the single best way we can overcome our own limitations is by relying on well-motivated, informed, others. We review to catch what our peers may have missed, to flag important methodological issues, to provide suggestions for clarifying and improving the presentation of results, among other such things. But given a disappointingly long history of capricious reviews, authors need assurance. So consider including in the next review a version of the following note:

Reviewers are fallible too. So this review doesn’t come with the implied contract to follow all ill-advised things or suffer. If you disagree with something, I would appreciate a small note. But rejecting a bad proposal is as important as accepting a good one.

Fear no capriciousness. And I wish you well.

(Software) Product Development Cycle

6 Mar

March 6, 2017November 11, 2017 editor

Solicit Ideas
Talk to customers, analyze usage data, talk to peers, managers, think, hold competitions, raffles,…
For each idea:
Define the idea
Write out the idea for clarity — at least 5–10 sentences. Do some due diligence to see what else is there. Learn and revise (including abandon).
Does the idea make business sense? Does it make sense to the developers (if the idea is mechanistic or implementation related)?
Ideally, peer review. But at the minimum: Talk about your idea with the manager(s) and the developers. If both manager(s) and developer(s) agree (for some mechanistic things, only developers need to agree), move to the next step.
Prototype
This is optional, and for major, complex innovations that can be easily prototyped only. Write code. Produce Results. Does it make sense? Peer review, if needed. If not, abandon. If it does, move to the next step
Write the specifications
Spend no less than 20% of the entire development time on writing the specs, including proposed functions, options, unit tests, concerns, implications. Run the specifications by developers, get this peer reviewed, improve, and finalize.
Set Priority and Release Target
Talk to the manager about the priority order of the change, and assign it to a particular release cycle.
Who Does What by When?
Create JIRA ticket(s)
Code
General cycle = code, test -> peer review -> code, test -> peer review …
MVP of Expected Code or Aspects of good code: ‘Code your documentation’ (well-documented), modular, organized, tested, nice style, profiled. Do it once, do it well.

The Innumerate American

19 Feb

February 19, 2017August 27, 2022 editor

In answering a question, scientists sometimes collect data that answers a different, sometimes yet more important question. And when that happens, scientists sometimes overlook the easter egg. This recently happened to me, or so I think.

Kabir and I recently investigated the extent to which estimates of motivated factual learning are biased (see here). As part of our investigation, we measured numeracy. We asked American adults to answer five very simple questions (the items were taken from Weller et al. 2002):

If we roll a fair, six-sided die 1,000 times, on average, how many times would the die come up as an even number? — 500
There is a 1% chance of winning a $10 prize in the Megabucks Lottery. On average, how many people would win the $10 prize if 1,000 people each bought a single ticket? — 10
If the chance of getting a disease is 20 out of 100, this would be the same as having a % chance of getting the disease. — 20
If there is a 10% chance of winning a concert ticket, how many people out of 1,000 would be expected to win the ticket? — 100
In the PCH Sweepstakes, the chances of winning a car are 1 in 1,000. What percent of PCH Sweepstakes tickets win a car? — .1%

The average score was about 57%, and the standard deviation was about 30%. Nearly 80% (!) of the people couldn’t answer that 1 in a 1000 chance is .1% (see below). Nearly 38% couldn’t answer that a fair die would turn up, on average, an even number 500 times every 1000 rolls. 36% couldn’t calculate how many people out of 1,000 would win if each had a 1% chance. And 34% couldn’t answer that 20 out of 100 means 20%.

If people have trouble answering these questions, it is likely that they struggle to grasp some of the numbers behind how the budget is allocated or for that matter, how to craft their own family’s budget. The low scores also amply illustrate that the education system fails Americans.

Given the importance of numeracy in a wide variety of domains, it is vital that we pay greater attention to improving it. The problem is also tractable — with the advent of good self-learning tools, it is possible to intervene at scale. Solving it is also liable to be good business. Given numeracy is liable to improve people’s capacity to count calories, and make better financial decisions, among other things, health insurance companies could lower premiums in lieu of people becoming more numerate, and lending companies could lower interest rates in exchange for increases in numeracy.

Motivated Citations

13 Jan

January 13, 2017May 3, 2018 editor

The best kind of insight is the ‘duh’ insight—catching something that is exceedingly common, almost routine, but something that no one talks about. I believe this is one such insight.

The standards for citing congenial research (that supports the hypothesis of choice) are considerably lower than the standards for citing uncongenial research. It is an important kind of academic corruption. And it means that the prospects of teleological progress toward truth in science, as currently practiced, are bleak. An alternate ecosystem that provides objective ratings for each piece of research is likely to be more successful. (As opposed to the ‘echo-system’—here are the people who find stuff that ‘agrees’ with what I find—in place today.)

An empirical implication of the point is that the average ranking of journals in which congenial research that is cited is published is likely to be lower than in which uncongenial research is published. Though, for many of the ‘conflicts’ in science, all sides of the conflict will have top-tier publications—-which is to say that the measure is somewhat crude.

The deeper point is that readers generally do not judge the quality of the work cited for support of specific arguments, taking many of the arguments at face value. This, in turn, means that the role of journal rankings is somewhat limited. Or more provocatively, to improve science, we need to make sure that even research published in low ranked journals is of sufficient quality.

(Numb)ers

7 Jan

January 7, 2017August 29, 2021 editor

Limited Time Only

Lifespan ~ 80 years*52 weeks/year ~ 4k weeks

If you are 40: 2k weeks to go.

If you are 70: another 500 weeks.

Well-read

If you read a book a week for 50 years, you will read ~ 2500 books.

Approximately 130M books have been published.

Poor

In 2013, India’s PPP adjusted median per capita income was ~$1,500. So ~ 400M people < $1500/yr.

Since it is PPP wage, think about living on $1500/year in the US
Another way to think about it:
The wage of median American working for a year (~ $27,000) > wage of 18 Indians working for a year
Yet another way to think about it:
Avg. career ~ 35 years (30–65)
What median American will earn in 2 yrs = lifetime earning of an Indian earning the median income

The Guilded Age

“How many hours are needed to qualify to take the State Board examinations?
Cosmetologist = 1600 hours, Barber = 1500 hours, Esthetician = 600 hours, Electrologist = 600 hours, Manicurist = 400 hours.” From https://www.barbercosmo.ca.gov/forms_pubs/publications/faqs.shtml

“California EMT programs are at least 160 hours and include at least 136 hours of didactic training and at least 24 hours of clinical training.”

https://www.healthcarepathway.com/become-an-emt/california-emt/

Chicken

There are ~ 25B chickens in the world. More here.

Earth

US from coast to coast is ~ 3k miles. The earth’s diameter is ~ 8k miles. Earth is puny. US is large.

Rasmussen vs. the Rest

12 Oct

October 12, 2016October 12, 2016 editor

For the data and scripts used to generate the graphs, see https://github.com/soodoku/pollbias.

Town Level Data on Cable Operators and Cable Channels

12 Sep

September 12, 2016November 11, 2017 editor

I am pleased to announce the release of TV and Cable Factbook Data (1997–2002; 1998 coverage is modest). Use of the data is restricted to research purposes.

Background

In 2007, Stefano DellaVigna and Ethan Kaplan published a paper that used data from Warren’s Factbook to identify the effect of the introduction of Fox News Channel on Republican vote share (link to paper). Since then, a variety of papers exploiting the same data and identification scheme have been published (see, for instance, Hopkins and Ladd, Clinton and Enamorado, etc.)

In 2012, I embarked on a similar such project—trying to use the data to study the impact of the introduction of Fox News Channel on attitudes and behaviors related to climate change. However, I found the original data to be limited—DellaVigna and Kaplan had used a team of research assistants to manually code a small number of variables for a few years. So I worked on extending the data. I planned on extending the data in two ways: adding more years, and adding ‘all’ the data for each year. To that end, I developed custom software. The data collection and parsing of a few thousand densely packed, inconsistently formatted, pages (see below) to a usable CSV (see below) finished sometime early in 2014. (To make it easier to create a crosswalk with other geographical units, I merged the data with Town lat/long (centroid) and elevation data from http://www.fallingrain.com/world/US/.)

Sample Page

Snapshot of the Final CSV

Soon after I finished the data collection, however, I became aware of a paper by Martin and Yurukoglu. They found some inconsistencies between the Nielsen data and the Factbook data (see Appendix C1 of paper), tracing the inconsistencies to delays in updating the Factbook data—“Updating is especially poor around [DellaVigna and Kaplan] sample year. Between 1999 and 2000, only 22% of observations were updated. Between 1998 and 1999, only 37% of observations were updated.” Based on their paper, I abandoned the plan to use the data, though I still believe the data can be used for a variety of important research projects, including estimating the impact of the introduction of Fox News. Based on that belief, I am releasing the data.

Teaching Social Science

12 Sep

September 12, 2016November 17, 2017 editor

Three goals: impart information, spur deeper thinking about the topic and the social world more generally, and inculcate care in thinking. As is perhaps clear, working toward achieving any one of these goals creates positive externalities that help achieve other goals. For instance, care in exposition, which is a necessary though insufficient condition for imparting correct information, is liable to produce, either through mimesis or further thought, care in how students think about questions.

Supplement such synergies by actively seeking and utilizing pertinent opportunities during both, class-wide discussions about the materials, and one-to-one discussions about research projects, to raise (and clarify) relevant points. During discussions, encourage students to seriously consider questions about epistemology, fundamental to science but also more generally to reasoning and discourse, by weaving in questions such as, “What is the claim that we are making?”, and “When can we make this claim and why?”.

Some of the epistemological questions are most naturally (and perhaps best) handled when students are engaged in working on their own research projects. Guiding students as they collect and analyze their own data provides unique opportunities to discuss issues related to research design, and logic. And it is my hunch that students are more engaged with the material (and hence learn more of it, and think more about it) when they work on their own projects than when asked to learn the materials through lectures alone. For instance, undergraduates at Stanford often excel at knowing the points made in the text, but often have yet to spend time thinking about the topic itself. My sense is (and some experience corroborates it) that thinking broadly about an issue allows students to gain new insights, and helps them contextualize their findings better. It also spurs curiosity about the social world and the broader set of questions about society. Hence, in addition to the above, ask students to discuss the topics that they are working on more generally, and think carefully and deeply about what else could be going on.

Stereotypical Understanding

11 Jul

July 11, 2016November 17, 2017 editor

The paucity of women in Computer Science, Math and Engineering in the US is justly widely lamented. Sometimes, the imbalance is attributed to gender stereotypes. But only a small fraction of men study these fields. And in absolute terms, the proportion of women in these fields is not a great deal lower than the proportion of men. So in some ways, the assertion that these fields are stereotypically male is in itself a misunderstanding.

For greater clarity, a contrived example: Say that the population is split between two similar sized groups, A and B. Say only 1% of Group A members study X, while the proportion of Group B members studying X is 1.5%. This means that 60% of those to study X belong to Group B. Or in more dramatic terms: activity X is stereotypically Group B. However, 98.5% of Group B doesn’t study X. And that number is not a whole lot different from 99%, the percentage of Group A that doesn’t study X.

When people say activity X is stereotypically Group B, many interpret it as ‘activity X is quite popular among X.’ (That is one big stereotype about stereotypes.) That clearly isn’t so. In fact, the difference between the preferences for studying X between Group A and B — as inferred from choices (assuming same choices, utility) — is likely pretty small.

Obliviousness to the point is quite common. For instance, it is behind arguments linking terrorism to Muslims. And Muslims typically respond with a version of the argument laid out above—they note that an overwhelming majority of Muslims are peaceful.

One straightforward conclusion from this exercise is that we may be able to make headway in tackling disciplinary stereotypes by elucidating the point in terms of the difference between p(X|Group A) and p(X| Group B) rather than in terms of p(Group A | X).

The Value of Money: Learning from Experiments Offering Money for Correct Answers

10 Jul

July 10, 2016November 19, 2017 editor

Papers at hand:

Prior et al. and Bullock et al.

Two empirical points that we learn from the papers:

1. Partisan gaps are highly variable and the mean gap is reasonably small (without money, control condition). See also: Partisan Retrospection?
(The point is never explicitly commented on by either of the papers. The point has implications for proponents of partisan retrospection.)

2. When respondents are offered money for the correct answer, partisan gap reduces by about half on average.

Question in front of us: Interpretation of point 2.

Why are there partisan gaps on knowledge items?

1. Different Beliefs: People believe different things to be true: People learn different things. For instance, Republicans learn that Obama is a Muslim, and Democrats that he is an observant Christian. For a clear exposition on what I mean by belief, see Waters of Casablanca.

2. Systematic Lazy Guessing: The number one thing people lie about on knowledge items is that they have the remotest clue about the question being asked. And the reluctance to acknowledge ‘Don’t Know’ is in itself a serious point worthy of investigation and careful interpretation.

When people guess on items with partisan implications, some try to infer the answer using the cues in the question stem. For instance, a Republican, when asked whether unemployment rate under Obama increased or decreased, may reason that Obama is a socialist and since socialism is bad policy, it must have increased the unemployment rate.

3. Cheerleading: Even when people know that things that reflect badly on their party happened, they lie. (I will be surprised if this is common.)

The Quantity of Interest: Different Beliefs.
We do not want: Different Beliefs + Systematic Lazy Guessing

Why would money reduce partisan gaps?

1. Reducing Systematic Lazy Guessing: Bullock et al. use pay for DK, offering people small incentive (much smaller than pay for correct) to confess to ignorance. The estimate should be closer to the quantity of interest: ‘Different Beliefs.’

2. Considered Guessing: On being offered money for the correct answer, respondents replace ‘lazy’ (for a bounded rational human —optimal) partisan heuristic described above with more effortful guessing. Replacing Systematic Lazy Guessing with Considered Guessing is good to the extent that Considered Guessing is less partisan. If it is so, the estimate will be closer to the quantity of interest: ‘Different Beliefs.’ (Think of it as a version of correlated measurement error. And we are now replacing systematic measurement error with an error that is more evenly distributed, if not ‘randomly’ distributed.)

3. Looking up the Correct Answer: People look up answers to take the money on offer. Both papers go some ways to show that cheating isn’t behind the narrowing of the partisan gap. Bullock et al. use ‘placebo’ questions, and Prior et al. timing etc.

4. Reduces Cheerleading: For respondents for whom utility from lying < $, they stop lying. The estimate will be closer to the quantity of interest: 'Different Beliefs.'

5. Demand Effects: Respondents take the offer of money as a cue that their instinctive response isn’t correct. The estimate may be further away from the quantity of interest: ‘Different Beliefs.’

Sample This: Sampling Randomly from the Streets

23 Jun

June 23, 2016November 19, 2017 editor

Say you want to learn about the average number of potholes per unit paved street in a city. To estimate that quantity, the following sampling plan can be employed:

1. Get all the streets in a city from Google Maps or OSM
2. Starting from one end of the street, split each street into .5 km segments till you reach the end of the street. The last segment, or if the street is shorter than .5km, the only segment, can be shorter than .5 km.
3. Get the lat/long of start/end of the segments.
4. Create a database of all the segments: segment_id, street_name, start_lat, start_long, end_lat, end_long
5. Sample from rows of the database
6. Produce a CSV of the sampled segments (subset of step 4)
7. Plot the lat/long on Google Map — filling all the area within the segment.
8. Collect data on the highlighted segments.

For Python package that implements this, see https://github.com/soodoku/geo_sampling.

Ipso Facto: Analysis of Complaints to IPSO

11 Jun

June 11, 2016June 13, 2016 editor

Independent Press Standards Agency (IPSO) handles complaints about accuracy etc. in the media in the U.K. Against which media organization are most complaints filed? And against which organization are the complaints most often upheld? We answer these questions using data from the IPSO website. (The data and scripts behind the analysis are posted on GitHub.)

Between its formation in September, 2014 and May 20th, 2016, IPSO received 371 complaints. Expectedly, tabloid newspapers are well represented. Of the 371 complaints, The Telegraph alone received 35 complaints, or about 9.4% of the total complaints. It was followed closely by The Mail with 31 complaints. The Times had 25 complaints filed against it, The Mirror and The Express 22 each, and The Sun, 19 complaints.

Generally, less than half the number of complaints were completely or partly upheld. Topping the list was The Express and The Telegraph with 10 upheld complaints each. And following close behind was The Times with 8 complaints, The Mail with 6, and The Sun and the Daily Star with 4 each.

See also the plot of batting average of media organizations with most complaints against them.

Clustering and then Classifying: Improving Prediction for Crudely-labeled and Mislabeled Data

8 Jun

June 8, 2016August 27, 2022 editor

Mislabeled and crudely labeled data are common problems in data science. Supervised prediction of such data expectedly yields poor results—coefficients are biased, and accuracy with regards to the true label is poor. One solution to the problem is to hand code labels of an `adequate’ sample and infer true labels based on a model trained on that data.

Another solution relies on the intuition (assumption) that the distance between rows (covariates) of a label will be lower than the distance between rows of different labels. One way to leverage that intuition is to cluster the data within each label, infer true labels from erroneous labels, and then predict inferred true labels. For a class of problems, the method can be shown to always improve accuracy. (You can also predict just the cluster labels.)

Here’s one potential solution for a scenario where we have a binary dependent variable:

Assume a mis_labeled vector called mis_label that codes some true 0s as 1 and some true 1s as 0s.

For each mis_label (1 and 0), use k-means with k = 2 to get 2 clusters within each label for a total of 4 labels
Assuming mislabeling rate < 50%, create a new col. = est_true_label, which takes: 1 when mis_label = 1 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 1), otherwise 0. 0 when mis_label = 0 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 0), otherwise 1.
Predict est_true_label using logistic regression and produce accuracy estimates based on true_labels and bias estimates in coefficient estimates (compared to coefficients from logistic regression coefficients from true labels)

The Missing Plot

27 May

May 27, 2016August 10, 2016 editor

Datasets often contain missing values. And often enough—at least in social science data—values are missing systematically. So how do we visualize missing values? After all, they are missing.

Some analysts simply list-wise delete points with missing values. Others impute, replacing missing values with mean or median. Yet others use sophisticated methods to impute missing values. None of the methods, however, automatically acknowledge that any of the data are missing in the visualizations.

It is important to acknowledge missing data.

One can do it is by providing a tally of how much data are missing on each of the variables in a small table in the graph. Another, perhaps better, method is to plot the missing values as a function of a covariate. For bivariate graphs, the solution is pretty simple. Create a dummy vector that tallies missing values. And plot the dummy vector in addition to the data. For instance, see:

(The script to produce the graph can be downloaded from the following GitHub Gist.)

In cases, where missing values are imputed, the dummy vector can also be used to ‘color’ the points that were imputed.

Some Facts About PolitiFact

27 May

May 27, 2016August 27, 2022 editor

I assessed PolitiFact on:

Imbalance in scrutiny: Do they vet statements by Democrats or Democratic-leaning organizations more than statements by Republicans or Republican-leaning organizations?
Batting average by party: Roughly n_correct/n_checked, but instantiated here as mean Politifact rating.

To answer the questions, I scraped the data from PolitiFact and independently coded and appended data on the party of the person or organization covered. (Feel free to download the script for scraping and analyzing the data, scraped data, and data linking people and organizations to parties from the GitHub Repository.)

Until now, Politifact has checked the veracity of 3,859 statements by 703 politicians and organizations. Of these, I was able to establish the partisanship of 554 people and organizations. I restrict the analysis to 3,396 statements by organizations and people whose partisanship I could establish and who lean either towards the Republican or Democratic party. I code the Politifact 6-point True to Pants on Fire scale (true, mostly-true, half-true, barely-true, false, pants-fire) linearly so that it lies between 0 (pants-fire) and 1 (true).

Of the 3,396 statements, about 44% (n = 1506) of the statements checked by PolitiFact are by Democrats or Democratic-leaning organizations. The rest of the roughly 56% (n = 1890) are by Republicans or Republican-leaning organizations. The average PolitiFact rating of statements by Democrats or Democratic-leaning organizations (batting average) is .63; it is .49 for statements by Republicans or Republican-leaning organizations.

To check whether the results are driven by some people receiving a lot of scrutiny, I tallied the total number of statements investigated for each person. Unsurprisingly, there is a large skew, with a few prominent politicians receiving a bulk of the attention. For instance, PolitiFact investigated more than 500 claims by Barack Obama alone. The figure below plots the total number of statements investigated for thirty politicians receiving the most scrutiny.

If you take out Barack Obama, the percentage of Democrats receiving scrutiny reduces to 33.98%. More generally, limiting ourselves to the bottom 90% of the politicians in terms of scrutiny received, the share of Democrats is about 42.75%.

To analyze whether there is selection bias in covering politicians who say incorrect things more often, I estimated the correlation between the batting average and the total number of statements investigated. The correlation is very weak and does not appear to vary systematically by the party. Accounting for the skew by taking the log of the total statements or by estimating a rank-ordered correlation has little effect. The figure below plots the batting average as a function of the total number of statements investigated.

Caveats About Interpretation

To interpret the numbers, you need to make two assumptions:

The number of statements made by Republicans and Republican-leaning persons and organizations is the same as that made by people and organizations on the left.
Truthiness of statements by Republican and Republican-leaning persons and organizations is the same as that of left-leaning people and organizations.

About 85% Problematic: The Trouble With Human-In-The-Loop ML Systems

26 Apr

April 26, 2016November 19, 2017 editor

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

The Case for Ending Closed Academic Publishing

21 Mar

March 21, 2016November 11, 2017 editor

A few commercial publishers publish a large chunk of top flight of academic research. And earn a pretty penny doing so. The standard operating model of the publishers is as follows: pay the editorial board no more than $70-$100k, pay for typesetting and publishing, and in turn get copyrights to academic papers. And then go on and charge already locked in institutional customers—university and government libraries—and ordinary scholars extortionary rates. The model is gratuitously dysfunctional.

Assuming there are no long term contracts with the publishers, the system ought to be rapidly dismantled. But if dismantling is easy, creating something better may not be. It just happens to be. A majority of the cost of publishing is in printing on paper. Twenty first century has made printing large organized bundles on paper largely obsolete; those who need it can print on paper at home. Beyond that, open source software for administering a journal already exists. And the model of a single editor with veto powers seems anachronistic. Editing duties can be spread around much like peer review. As unpaid peer review can survive as it always has, though better mechanisms can be thought about. If some money is still needed for administration, it could be gotten easily by charging a nominal submission tax, waived where the author self identifies as being unable to pay.

Interpeting Clusters and ‘Outliers’ from Clustering Algorithms

19 Feb

February 19, 2016March 21, 2016 editor

Assume that the data from the dominant data generating process are structured so that they occupy a few small portions of a high-dimensional space. Say we use a hard partition clustering algorithm to learn the structure of the data. And say that it does—learn the structure. Anything that lies outside the few narrow pockets of high-dimensional space is an ‘outlier,’ improbable (even impossible) given the dominant data generating process. (These ‘outliers’ may be generated by a small malicious data generating processes.) Even points on the fringes of the narrow pockets are suspicious. If so, one reasonable measure of suspiciousness of a point is its distance from the centroid of the cluster to which it is assigned; the further the point from the centroid, the more suspicious it is. (Distance can be some multivariate distance metric, or proportion of points assigned to the cluster that are further away from the cluster centroid than the point whose score we are tallying.)

How can we interpret an outlier (score)? Tautological explanations—it is improbable given the dominant data generating process—aside.

Simply providing distance to the centroid doesn’t give enough context. And for obvious reasons, for high-dimensional vectors, providing distance on each feature isn’t reasonable either. A better approach involves some feature selection. This can be done in various ways, all of which take the same general form. Find distance to the centroid on features on which the points assigned to the cluster have the least variation. Or, on features that discriminate the cluster from other clusters the best. Or, on features that predict distance from the cluster centroid the best. Limit the features arbitrarily to a small set. On this limited feature set, calculate cluster means and standard deviations, and give standardized distance (for categorical variable, just provide ) to the centroid.

Sampling With Coprimes

1 Jan

January 1, 2016November 19, 2017 editor

Say you want to sample from a sequence of length n. Multiples of a number that is relatively prime to the length of the sequence (n) cover the entire sequence and have the property that the entire sequence is covered before any number is repeated. This is a known result from number theory. We could use the result to (sequentially) (see below for what I mean) sample from a series.

For instance, if the sequence is 1, 2, 3,…, 9, the number 5 is one such number (5 and 9 are coprime). Using multiples of 5, we get:

1	2	3	4	5	6	7	8	9
				X
X					X
	X					X
		X					X
			X					X

If the length of the sequence is odd, then we all know that 2 will do. But not all even numbers will do. For instance, for the same length of 9, if you were to choose 6, it would result in 6, 3, 9, and 6 again.

Some R code:


seq_length = 6
rel_prime  = 5
multiples  = rel_prime*(1:seq_length)
multiples  = ifelse(multiples > seq_length, multiples %% seq_length, multiples)
multiples  = ifelse(multiples ==0, seq_length, multiples)
length(unique(multiples))

Where can we use this? It makes passes over an address space less discoverable.

Clarifai(ng) the Future of Clarifai

31 Dec

December 31, 2015November 19, 2017 editor

Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.

As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.

Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. The task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting the users manage it at their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.

More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.

There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I noted above but it needn’t — a user could want just this.) By the same token, Clarifai could build general-purpose discrimination engines — a list of tags that distinguishes image(s) the best.

Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.