Exploratory data analysies

Exploratory data analysis (EDA) is valuable for gaining an understanding of your data before diving into predictive analytics. EDA is not technically a required step in a predictive analytics work flow. In some contexts, it may be possible to achieve excellent model performance with raw data measures and without understanding the underlying data very well. However, EDA is typically recommended and often, extremely valuable for…

Data Quality Assessment: EDA uncovers data quality issues like missing values, outliers, or implausible values, which might be the result of data entry errors. Addressing these quality concerns is imperative as they can drastically affect model accuracy. Plots of variables’ distributions and summary statistics (e.g., mean, median, standard deviation, range, quartiles, interquartile range) aid in this assessment.

Informational Value of Variables: Variables with little variation often add negligible value to predictive models. For instance, a binary predictor that’s consistent across 99% of observations offers limited insight. Similarly, a continuous variable clustering around a singular value isn’t very informative. Again plots and statistics that summarize variables’ distributions are helpful.

Identification of Potential Predictors:Correlations between independent variables and the outcome of interest can hint at important predictors. However, keep in mind that variables not strongly correlated with the outcome can still be valuable in models, especially when they interact with other variables or aid in capturing complex, non-linear relationships. It is difficult or impossible to visualize interaction terms (i.e. when predictors have a different relationship with the outcome when considered together than when considered apart).

Redundancy Detection: Tools like correlation matrices or heatmaps elucidate inter-variable correlations, highlighting redundancy. When faced with overlapping predictors, it’s beneficial to opt for the most explanatory variable, especially when aiming for model simplicity and interpretability.

Potential Bias: EDA is instrumental in detecting biases. Visual tools, such as histograms, display group distributions. An underrepresentation of a specific group, compared to the target population, can introduce model bias. Furthermore, a disproportionate frequency of missing values or outliers for certain groups is another bias indicator. Moreover, It’s crucial to evaluate the relationship between multiple variables and protected attributes (e.g., race, gender) or other important equity attributes (e.g., income level, geographical location). Variables strongly correlated with these sensitive attributes can inadvertently become proxies, potentially leading to biased decisions. For instance, using a neighborhood-based variable that correlates closely with race might unintentionally result in decisions influenced by racial attributes. EDA is an important first step for being attentive to bias, but assessing bias of model results is also essential. We will turn to this soon.