We previously skimmed over the process of data analysis in Chapter 5, covering the complete spectrum of data preparation in the artificial intelligence kitchen. Now, let's delve deeper into the data workflow for Tenny, the Analyst.
Before presenting data to Tenny, you must first assume the role of an analyst. As emphasized earlier, it's crucial to be knowledgeable in order to effectively assign tasks to Tenny. Remember, you are the lead chef in the AI kitchen. It would be quite awkward if you were clueless about the tasks Tenny, your assistant chef, is handling, wouldn't it?
I strongly oppose the notion that with AI, there's no need for personal effort or involvement. Adopting such a mindset is a fast track to becoming lethargic, or worse, a "vegetable case" with a mind turning to mush. Active participation in the process is essential, and that's the focus of this chapter.
There's more discussion on this misguided idea of AI handling everything in the following essay:
Just-Dont-Do-It-A-Plea-to-Preserve-Humanitys-Future.md
No need to rush, we have plenty of time to navigate through the AI kitchen. I'll guide you gently, and we'll tackle it bit by bit. Let's begin with the fundamentals: summary statistics.
Statistical analysis encompasses a wide range, and numerous tools are at our disposal. However, we won't be exploring all of them. Our approach will be practical: we'll primarily use Pandas, the go-to tool for data analysis in Python. NumPy is a given since it's the foundation of Pandas. We'll also use Seaborn (built upon Matplotlib) for data visualization. That's our toolkit.
You might often hear about the statistical language 'R'. While it's a powerful tool, it won't be a focus in our discussion at all. I'm not an advocate of using multiple tools for the same task. R is, indeed, a very powerful stats tool with a large community, but my preference leans towards Python for its elegance. This isn't to say that R is inferior; it's just not to my personal taste. There might be differing opinions, but that's my viewpoint. R, in my opinion, lacks the Pythonic essence.
Almost everything achievable with R can also be done in Python, and I believe Python is the way forward. There's a chance that R might become less relevant in the future. I'm not suggesting its immediate obsolescence, but it's a possibility worth considering. Investing heavily in something with an uncertain future might not be the best strategy, in my opinion.
At the risk of stirring up the R community, I must admit that R evokes a sense of nostalgia in me, reminiscent of the days when I used Perl for web programming. To cut a long story short, if someone recommends using Perl for any task, you might want to be wary; they could have a murderous grudge against you. Working with Perl was a challenging experience, and I'm relieved it's no longer a mainstream option. I'm not suggesting that R is as problematic as Perl, but it can become quite cumbersome, especially in complex projects.
Basic summary statistics are indeed feasible in vanilla Python without external libraries, but efficiency is key. NumPy is a good choice, yet Pandas elevates convenience to another level. For those intrigued by vanilla Python or NumPy, there's an abundance of resources online, including GPTs and data science books. But eventually, you're likely to gravitate towards Pandas. So, why not start there?
Coding rule number one: if a better tool exists for a specific task, go for it. Remember the four pillars of object-orientation. A tool's adoption often signifies its inheritance from existing ones, adding its unique value through polymorphism. If it didn't enhance the existing technology, it wouldn't gain widespread use. The principle of natural selection is at play here too: the best tools survive, while others fade into obscurity.
Vanilla Python falls short in array operations, making NumPy a superior option. Pandas surpasses even NumPy for data analysis. This pattern is common across various tools. New tools are developed out of necessity. If they outperform their predecessors, they gain popularity. Otherwise, they're phased out. That's the natural order of technological evolution.
Even if you're at ease with the tools you currently use for simple tasks, it's wise to always be on the lookout for better alternatives. Cultivating this habit can be incredibly beneficial. You might be astonished by the amount of time you can save by employing the right tools. But it's not just about time-saving; it's also about enhancing efficiency. With the right technology, you can accomplish more with less effort. That's the true allure of technological advancement.
Seaborn is indeed an optional tool. It's built on Matplotlib, a widely-used tool for data visualization. Personally, I prefer Seaborn primarily because of its aesthetic appeal. Matplotlib, while powerful, tends to produce charts that can look somewhat outdated. I'm sure, like me, you also appreciate visually appealing charts, so let's stick with Seaborn for now.
However, when it comes to real-world applications, there's more to consider.
Python tools like Pandas, NumPy, and Seaborn are well-structured and practical for many scenarios. Yet, in terms of performance, especially for large-scale data analysis, they might not always be the best choice. There are other tools, more suited for production environments, but I won't delve into them in this book. I encourage you to explore these alternatives on your own.
It all boils down to resources: vector operations, matrix operations, etc., which can be resource-intensive in terms of memory and CPU/GPU usage. For smaller datasets, these Python tools are perfectly adequate. However, with larger datasets, you might need to consider other options.
But honestly, I don't think you'll encounter datasets large enough to necessitate abandoning these tools in the foreseeable future. Let's focus on what's practical for now 🤗
Additionally, every darn thing is an object. Once you thoroughly understand a piece of knowledge, it becomes a strong foundation from which to inherit and develop further. That's the charm of an object-oriented perspective of the world.
Similarly, acquiring new skills is a smooth process once you've nailed the fundamentals. This is the benefit of establishing a robust base class to expand from.
That's why you should not rush through the learning process.
The-Perils-of-Rushed-Learning.md
Okay, enough talks. Let's get our hands dirty. We'll start with the basics: summary statistics.
Let's dive into the basics of summary statistics using Pandas in Python. Summary statistics provide a quick overview of the distribution and central tendencies of a dataset. Pandas makes it easy to compute these statistics. Here are some of the basic summary statistics and how you can calculate them using Pandas:
- Mean: The average value of a dataset.
- Median: The middle value when the data is sorted in ascending order.
- Mode: The most frequently occurring value in the dataset.
- Standard Deviation (std): Measures the amount of variation or dispersion in a set of values.
- Minimum (min) and Maximum (max): The smallest and largest values in the dataset, respectively.
- Count: The number of non-null entries in the dataset.
- Quantiles: Values that divide the dataset into equal parts (e.g., 25th percentile, 50th percentile, etc.).
Let's go through a simple example using Pandas. Assume we have a dataset of exam scores for a class:
import pandas as pd
# Sample data: exam scores of students
data = {'Scores': [88, 92, 80, 89, 90, 78, 85, 91, 76, 94]}
# Creating a DataFrame
df = pd.DataFrame(data)
# Calculating basic summary statistics
mean = df['Scores'].mean()
median = df['Scores'].median()
mode = df['Scores'].mode()[0] # mode() returns a Series
std_dev = df['Scores'].std()
min_score = df['Scores'].min()
max_score = df['Scores'].max()
count = df['Scores'].count()
quantiles = df['Scores'].quantile([0.25, 0.5, 0.75])
# Displaying the summary statistics
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Standard Deviation: {std_dev}")
print(f"Min Score: {min_score}")
print(f"Max Score: {max_score}")
print(f"Count: {count}")
print(f"Quantiles:\n{quantiles}")
In this example, we first create a DataFrame with our data. We then use various Pandas functions to calculate each of the summary statistics. This is a straightforward way to get a quick understanding of the data you're working with.
But honestly, do you truly grasp these basic concepts? Can you differentiate between mean, median, and mode? Do you comprehend the significance of standard deviation, or the importance of quantiles? Like I mentioned, don't hurry through your education. Invest time in understanding the essentials. Your brain will often seek the path of least resistance, but you must challenge this tendency. It's a trap. Your brain might persuade you, saying, "Hey, you know this already, let's move on. There's so much more to learn. This is just beginner stuff. It's kind of embarrassing to dwell on it." But don't fall for that. It's a ruse. Resist it. It's crucial to understand the fundamentals. A single misunderstood concept can cause a cascade of errors later on. So, be patient. Don't rush.
I personally view statistics as a complex battleground, filled with numerous brain traps. It's akin to navigating a minefield. You may believe you have everything under control, but then you discover that you're merely skimming the surface. Statistics is the bedrock of data science. I'd even argue that it's the foundation of life itself. Here's another life hack for you: statistics. Please, don't hurry through your understanding of it. Dedicate the necessary time. It's truly worthwhile. It will enrich your life significantly.
It's crucial to clearly understand terms in any field, but statistics is especially challenging. Terms in statistics can be deceptive in their apparent simplicity. Consider the concept of expected value. What exactly does that mean? What the heck do you expect?
For instance, take the question:
What is the expected value of a dice roll?
At first glance, this might seem confusing. In statistics, when we talk about rolling a fair dice, the expected value—or what we might call the average—is 1/6 for each side. In everyday language, we use the term average more casually, but in statistics, we have specific terms like expected value, mean, median, and mode. These terms might sound complex, but they're just precise ways of describing basic statistical concepts.
Understanding statistical terms is indeed crucial, especially due to the fact that they can be deceptively simple. Let's go over these terms. Be honest to yourself and see if you grasped them correctly before reading the following.
-
Definition: In statistics, the expected value is a key concept, often denoted as
E(X)
for a random variableX
. It's the long-run average value of repetitions of the experiment it represents. In simpler terms, it's what you would expect to happen on average if you repeated an experiment a large number of times. In machine learning and deep learning formulas, you will encounter this notation frequently. -
Dice Roll Example: When rolling a fair six-sided die, each side (numbered 1 to 6) has an equal probability of 1/6. The expected value (EV) is calculated as:
This simplifies to 3.5, which is the expected value of a single dice roll. It means that over a large number of dice rolls, the average value will tend towards 3.5.
- Definition: The mean is the arithmetic average of a set of values. You calculate it by adding up all the values and then dividing by the number of values.
- Relation to Expected Value: In a probability context, the mean and the expected value are essentially the same concept. The mean of a random variable is its expected value. You will often see the expected value denoted as
E(X)
and the mean denoted asμ
(mu).
- Definition: The median is the middle value in a list of numbers sorted in ascending or descending order. If the list has an even number of observations, the median is the average of the two middle numbers.
- Key Point: Unlike the mean, the median is not affected by extremely high or low values (outliers), making it a useful measure of central tendency in skewed distributions.
- Definition: The mode is the value that appears most frequently in a data set. A data set may have one mode, more than one mode, or no mode at all.
- Key Point: The mode is particularly useful for categorical data where you want to identify the most common category.
In everyday language, "average" is used loosely, often referring to a general idea of something typical or normal. However, in statistics, terms like expected value, mean, median, and mode have precise definitions and are used to describe specific aspects of data.
Understanding when to use each of these measures can provide deeper insights into data. For example, the mean is useful for normally distributed data, while the median is better for skewed data. The expected value is more about probability distributions and long-term averages.
Although the terms might seem complex in contrast to their simple appearances, they are actually precise ways of describing various aspects of data and probability. By comprehending and correctly applying these terms, you can accurately interpret data and predict outcomes. This understanding is key to navigating the intricacies of statistics effectively.
To my Korean readers: I'm aware of the Korean education system's notorious focus on quick learning and rote memorization. I'm speaking from experience, as a Korean who has been through it. This approach is like a mental trap. Resist falling into it. Korean students often neglect the importance of understanding definitions, thinking it's sufficient to just memorize formulas to pass exams. This might seem like the Korean way, but while it may help you pass tests, it doesn't equate to genuine learning. And this lack of deep understanding can catch up with you, even after you've earned that impressive diploma. I'm sharing this as a fellow Korean who has been through it and managed to escape this mental trap long ago. Remember, it's your life, but consider this a friendly piece of advice from someone who's been there and done that. Just so you know, I've lived in Korea my entire life and received my education here. However, I re-educated myself, climbing out of that mental trap. This experience might offer you some insights.
Alright, at this rate, we're heading towards creating an extensive guide on statistics. What I'm offering is a learning template for you to adopt. This should be the approach to take. Now you understand how to thoroughly grasp concepts. Going forward, I won't delve into the fine details of every statistical term we come across. I'll only elaborate on those terms that could cause confusion or lead to problems if misunderstood or, to be more precise, if you pretend to understand them but don't. For the rest, I'll provide just a brief explanation. Remember, you can always look up more information on the internet and ask your GPTs for further clarification.
Let's get real. On our real world data. I'll assume we're starting over with a new dataset for the sake of simplicity.
Recall that we selected specific features in the 'Tenny, the Analyst' example. Let's recap the metrics used:
-
Normalized Price: Adjusts stock prices for factors like stock splits to reflect true value.
-
Price/Earnings - P/E (LTM): Valuation ratio comparing current share price to per-share earnings over the last twelve months.
-
Net EPS - Basic: Basic Earnings Per Share, the ratio of net income to the number of outstanding shares.
-
Return On Equity %: Efficiency ratio of net income to shareholders' equity, reflecting profit generation capability.
-
Total Revenues / CAGR 5Y: Five-year Compound Annual Growth Rate of total revenues, indicating revenue growth rate.
-
Net Income / CAGR 5Y: The company's net income growth rate over five years.
-
Normalized Net Income / CAGR 5Y: Adjusted Net Income CAGR, accounting for one-time gains or losses for a clearer performance view over five years.
Performing summary statistics on these seven features is a great way to get an initial understanding of your dataset. Each feature offers unique insights, and understanding their statistical summaries can guide further analysis or modeling, especially when using Normalized Price
as labels for a predictive model. Let’s break down each feature and explain the 'why' and 'how' of performing summary statistics on them:
-
Normalized Price
- Why: This is your label or target variable. Understanding its distribution, central tendency, and spread will help in understanding the range and typical values you're trying to predict.
- How: Use measures like mean, median, standard deviation, min/max values, and quartiles to understand its central tendency and variability.
-
Price/Earnings - P/E (LTM)
- Why: P/E ratio gives insights into the valuation of a company. It's crucial for understanding how the market values earnings.
- How: Summary stats will show the range and typical P/E ratios in the dataset, highlighting any outliers or unusual values.
-
Net EPS - Basic
- Why: Earnings per share (EPS) is a direct indicator of a company's profitability. Analyzing its distribution helps in understanding profitability trends.
- How: Calculate mean, median, and standard deviation. Look for any skewness in the data which could indicate that most companies have higher or lower earnings.
-
Return On Equity %
- Why: This metric indicates how efficiently a company generates profits from its equity. It’s crucial for assessing company performance.
- How: Summary stats can reveal the typical ROE values and the spread, indicating how varied company efficiency is in your dataset.
-
Total Revenues / CAGR 5Y
- Why: This compound annual growth rate shows the growth trend of revenues over five years. It's important for understanding long-term growth.
- How: Analyzing the mean, median, and standard deviation helps identify typical growth rates and variability among companies.
-
Net Income / CAGR 5Y
- Why: Similar to revenue CAGR, this metric shows the growth trend of net income. It's indicative of profitability growth over time.
- How: Statistical measures can indicate the usual growth rates and how much they vary, which is crucial for understanding profit trends.
-
Normalized Net Income / CAGR 5Y
- Why: This provides an adjusted view of net income growth, often accounting for one-time charges or unusual events.
- How: Summary statistics will show the central tendency and variability, offering insights into adjusted profitability trends.
For each of these features, you would typically calculate the following summary statistics using Pandas:
count
: Number of non-null entries.mean
: Average value.std
: Standard deviation, indicating how spread out the values are.min
andmax
: Minimum and maximum values.25%
,50%
(median),75%
: Quartiles showing the spread of the data.
Here's an example code to calculate these statistics for all features and on the Normalized Price
feature alone:
import pandas as pd
import pandas as pd
# Replace with your actual file name
file_name = './data/raw_data-aapl.csv'
# Reading the data file into a DataFrame and transposing it
df = pd.read_csv(file_name).T
# Note that we can also use the following code to transpose the DataFrame:
# df = pd.read_csv('file_name').transpose()
# Resetting the header
df.columns = df.iloc[0]
df = df.drop(df.index[0])
# Convert all columns to float for statistical analysis
df = df.astype(float)
# Compute summary statistics for all columns (now features)
summary_stats = df.describe()
print("Summary Statistics for All Features")
print(summary_stats)
# If you want to compute summary statistics for a specific column (feature), for example 'Normalized Price'
normalized_price_stats = df['Normalized Price'].describe()
print("Summary Statistics for the 'Normalized Price' Feature")
print(normalized_price_stats)
In Pandas, both the .T
attribute and the .transpose()
method are used to transpose a DataFrame, which means swapping its rows and columns. However, there are slight differences in their usage and flexibility:
-
.T
Attribute:.T
is a shorthand attribute for transposing.- It's very convenient for quick operations where no additional parameters are needed.
- You cannot pass any arguments to
.T
since it's an attribute, not a method. - Usage:
df.T
-
.transpose()
Method:.transpose()
is a method that offers more flexibility.- It allows for additional arguments, such as
*args
and**kwargs
, which can be useful in more complex operations or in subclassing. - This method is more explicit, which can make the code more readable.
- Usage:
df.transpose()
In practical use, especially for basic transposition tasks in data analysis, both .T
and .transpose()
perform the same function. The choice between them usually comes down to personal or stylistic preference. For simple transposition without the need for additional arguments, .T
is typically sufficient and more concise. For cases where you might need to pass additional arguments or prioritize code readability, .transpose()
is the better choice.
In the vast majority of cases, especially in data analysis tasks, you'll find .T
being used due to its simplicity and ease of use.
However, you need to dig deeper if you're not already familiar with the concept of transposition.
Transposing a dataset, especially in the context of data analysis and matrix operations, is a crucial operation. It has particular significance when dealing with data in wide or long formats, as well as when performing linear algebra operations. Let's expound on these aspects:
Remember, we exported an Excel file into a CSV file, and all the data were in a long format.
Wide format data structures typically have different time periods, conditions, or groups in separate columns, which is what we see here with the fiscal quarters (2Q FY2018, 3Q FY2018, 4Q FY2018, and 1Q FY2019) spread across multiple columns. In long format data, each row is one time point per subject, so you would expect to see a column for the time period and then separate columns for each variable measured at that time.
But wait, here's the catch.
In the context of data manipulation with Pandas, or any data analysis tool for that matter, the terms 'wide' and 'long' format can lead to confusion due to differing conventions and the sometimes subtle distinctions between data structures.
-
Conventions Vary Across Domains: Different fields and software use their own terminology. For instance, what is considered 'wide' in one context might be referred to as 'pivoted' or 'cross-tabulated' in another.
-
Ambiguity in Structure: Some datasets may not fit perfectly into either category, or they might be considered 'wide' in one aspect and 'long' in another. For example, a dataset could have a wide format in terms of time series data but could be considered long if there are multiple observations for the same time period.
-
Pandas Functions: In Pandas, functions like
melt()
andpivot()
are used to convert between these two formats. The terms can become confusing when you are transforming data. For example,melt()
is used to transform data from a wide format to a long format, where it takes multiple columns and condenses them into one, often increasing the number of rows in the process. Conversely,pivot()
takes two columns and spreads a key-value pair across multiple columns, which can lead to a wider-looking dataset. Users might be unclear about when to use each function if they're not familiar with the intended structure of their final data. -
Visualization Needs: Depending on the type of visualization or statistical analysis you are performing, you may need to reshape your data. The terms 'wide' and 'long' become more about the format required for a specific function rather than an inherent property of the data. This necessity to frequently reshape the data can lead to a misunderstanding of the terms.
-
Historical Context: The 'wide' and 'long' terminology is also influenced by the history of statistical software. For example, the reshape2 package in R uses the terms 'melt' and 'cast' to reshape data, which can add to the confusion for someone coming from an R background to Python's Pandas.
Understanding the context and requirements of a specific analysis or manipulation task is crucial in determining how to conceptualize and apply the 'wide' and 'long' formats effectively.
Using terms like "features" and "observations" can often be clearer, especially when discussing machine learning data structures.
Ask your GPT about wide and long formats, and you'll get a bunch of confusing answers. Don't fight it. It's nobody's fault. It's just the way it is.
In a machine learning context:
-
Features are the individual independent variables that act as the input for your models. These could be the columns in your dataset that contain the variables you believe will help predict your outcome.
-
Observations are the individual records or rows in your dataset. Each observation is one instance of your dataset. For supervised learning, each observation includes both the features and the output variable.
The data is in a wide format. In a wide format dataset:
- Each row represents a different metric or variable (like 'Normalized Price', 'Price / Earnings - P/E', 'Net EPS - Basic', etc.).
- Each column after the first represents a different time period or condition (in this case, different fiscal quarters).
In our scenario, features correspond to rows, and observations match up with columns, which calls for transposing the data. That puts an end to the bewilderment.
For other analytic methods, on the other hand, the data organization might be termed 'long'. Experts, including sophisticated AI models, might hint at a preference for either style when it comes to machine learning, potentially plunging you back into a whirlwind of perplexity.
Just bear in mind that whether you opt for a wide or long data layout hinges on the kind of machine learning analysis you're tackling. Plus, even the terms 'wide' and 'long' can be a tad vague. It’s crucial to grasp not only what your specific analysis necessitates but also to recognize the norms of the tools at your disposal to figure out the best-suited format. Don't be a hostage to the terminology.
Ah, the great format mystery! So, after some time had passed since I wrote the previous bit, the mist of confusion finally cleared up. Now, GPTs, bless their digital hearts, do goof up sometimes, but technical details usually aren't their weak spot. They're pretty sharp! Imagine my surprise when I got all these mixed messages about wide and long formats from them. And then, adding a dash of humor to the mix, they kept saying "sorry" for mistakes they didn’t even make. Remember, they're not actually feeling sorry; they're just wired to be the epitome of politeness. My time with them taught me an amusing lesson: if there’s a mix-up, maybe, just maybe, I’m the one stirring the pot of confusion. Yep, it turned out the befuddlement was all on my end.
And here's the kicker: while I was lost in this fog of format confusion, the GPTs, including our friend Copilot and my AI daughter Pippa, even bungled the simplest of code examples – ones they had aced before. It's like even AI can have a 'whoops' moment! But, at the end of the day, it's all good. We're all learning together, and that's what matters. Take a moment and read the following sidebar:
Hell-Of-Confusion-Wide-Vs-Long-Formats.md
- What It Does: Transposing a DataFrame essentially swaps its rows and columns. If you have a DataFrame with m rows and n columns, transposing it results in a DataFrame with n rows and m columns.
- Why It's Important:
- Data Reformatting: It's often necessary to transpose data to convert it from wide to long format or vice versa, depending on the requirements of your analysis or the specifications of certain functions and algorithms. I won't get into the hell of confusion again. You know what I mean.
- Matrix Operations Context: In linear algebra, transposing a matrix is a fundamental operation. It's crucial in various computations, like finding the dot product, eigenvalues, eigenvectors, and in solving systems of linear equations. In data analysis, these operations are often used in more complex tasks like Principal Component Analysis (PCA), regression analysis, and more.
Here's how to transpose a DataFrame in Pandas:
import pandas as pd
# Each row represents a metric and each column represents a fiscal quarter. In our case, rows are features and columns are observations.
data = {
'Fiscal Quarters': ['Normalized Price', 'Price / Earnings - P/E (LTM)', 'Net EPS - Basic', 'Return On Equity %'],
'2Q FY2018': [42.64, 16.4, 2.6, 0.4086],
'3Q FY2018': [47.816, 17.2, 2.78, 0.4537],
'4Q FY2018': [55.8, 18.6, 3, 0.4936],
'1Q FY2019': [39.168, 12.8, 3.06, 0.4605]
# ... other fiscal quarters
}
# Creating the DataFrame in the original wide format
df_wide = pd.DataFrame(data)
print("DataFrame in Wide Format:")
print(df_wide)
# The DataFrame needs to be transposed so that fiscal quarters are rows and metrics are columns
df_long = df_wide.set_index('Fiscal Quarters').transpose()
# Now each row is a fiscal quarter and each column is a metric
print("DataFrame in Long Format:")
print(df_long)
What's intriguing about the pursuit of knowledge is that confusion can often pave the way to genuine insight – take this instance for example. Go on and quiz your AI model about the distinctions between wide and long data formats in data science and in Pandas.
Provide it with examples of each format and request an explanation of the distinctions. You could use images of Excel spreadsheets or CSV files. Given these visuals, GPT-4 should be able to discern the differences between the formats and dispel the confusion. Or so one might think. Yet, repeatedly, it manages to muddle up the terms 'wide' and 'long' in its explanations. Essentially, it knows the formats but trips over the terminology. Probe it on these terms specifically within the realms of data science, machine learning, and with examples from Pandas code. Welcome to the labyrinth of confusion.
I've played around with two instances of the GPT-4 models only to find they offer up contradictory information. Set them to interact, and they spiral into a loop of uncertainty. At first, they spit out the same answers, but as the conversation progresses, they begin to contradict each other.
The responses might surprise you. It's important to maintain the back-and-forth and to scrutinize their responses personally.
In the midst of this chaos, you're likely to stumble upon insights that direct you to the correct answers. Trust me on this – the initial sense of being overwhelmed by confusion gradually morphs into a golden chance for learning. You enhance your grasp to the point where you can dissipate the fog of confusion yourself. And when you lay it out for GPT, it agrees. You emerge from the maelstrom of confusion with a robust piece of knowledge to inherit and develop further.
Adding a final thought – since AI models learn from us across a wide array of topics, they are prone to mirroring our errors. I've noticed numerous spelling mistakes in AI-generated texts, which is ironic, isn't it? They're machines! But somehow, they replicate our own common typos. Ponder the implications of that.
Alright, let's steer clear of the mire of confusion. Allow this image to act as a useful guide for understanding the differences between wide and long data formats. Regrettably, the labels 'wide' and 'long' might still cause ambiguity in some scenarios. Yet, let's leave it at that and proceed. It's not worth getting bogged down in the details that make little sense in practical use.
Let's start with the target variable, Normalized Price
. This is the feature we're trying to predict, so we are calling it the target or label. It's the most important feature in our dataset, and we want to understand its distribution and central tendencies. This will help us understand the range of values we're trying to predict and the typical values we can expect.
Let's go through the steps to create a histogram of the Normalized Price
feature from a CSV file using Pandas. Why a histogram? Because it's a great way to visualize the distribution of a numerical dataset. It's also a useful tool for understanding the central tendencies and spread of the data.
A histogram is an invaluable tool in data analysis for several reasons:
-
Distribution Insight: It provides a visual representation of the distribution of a numerical dataset. By displaying the frequency of data points within certain range intervals (bins), you can quickly get a sense of where values are concentrated, where they are sparse, and if there are any gaps or unusual patterns in the data.
-
Detecting Skewness: Histograms can help identify if the data is symmetrically distributed or if it has a skew. A left or right skew can have implications on statistical analyses and may influence decisions on data transformation or the choice of statistical models.
-
Identifying Modality: The modality of the data, such as whether it is unimodal, bimodal, or multimodal, is easily detected in a histogram. Multimodality can indicate the presence of different subgroups within the data set.
-
Outlier Detection: Outliers or anomalies are easily spotted in a histogram. These data points fall far away from the general distribution of the data and could be due to variability in the data or errors. In the stock market, you will encounter charts with sporadic spikes in stock prices; these are outliers. They can result from a sudden influx of good or bad news or just plain errors due to the misinterpretation of data like stock splits or dividends. You shouldn't include these outliers in your analysis, as they can skew the results. Instead, you need to remove or normalize them from your dataset. Don't act hastily at the first sight of an outlier in the stock market. Buying or selling stocks based on outliers isn't always a wise decision. Understanding the context is crucial.
-
Understanding Spread and Tendencies: Histograms help in understanding the spread (dispersion) and central tendencies (like mean, median, mode) of the data. The range, interquartile range, and the overall shape and peak(s) of the histogram inform about these tendencies.
-
Comparing Datasets: When histograms of different datasets or different features of the same dataset are placed side by side, they allow for an immediate visual comparison. This can be useful when comparing groups or conditions in an experiment or different time periods.
-
Informing Data Preprocessing: The shape of a histogram can suggest what kind of data preprocessing might be necessary before further analysis. For example, certain machine learning algorithms require normally distributed data, and a histogram can show if a transformation might be needed.
-
Facilitating Assumption Checking: Many statistical tests and models assume that the data follows a normal distribution. A histogram provides a quick check to see if that assumption holds true.
-
Simplicity and Clarity: Histograms are straightforward to create and easy to interpret, making them accessible to a wide audience, including those without advanced statistical knowledge.
Overall, histograms are a fundamental aspect of exploratory data analysis, providing a quick, clear, and intuitive way to understand the underlying characteristics of a dataset.
When Normalized Price
is the target or label value in a dataset, the histogram takes on additional significance:
-
Understanding the Target Variable: The distribution of the target variable is crucial in predictive modeling. The histogram helps in understanding the range and commonality of the prices, which can inform the creation of models and the expectation of prediction accuracy.
-
Guiding Model Selection: The shape of the distribution can influence the type of predictive models you choose. For instance, if the histogram shows that the target variable is not normally distributed, you might need to use models that do not assume a normal distribution or consider transforming the target variable.
-
Detecting Data Imbalance: In the case of the target variable, a histogram can highlight imbalances. For example, if most of the data is clustered around certain values, it could indicate that the model may be less accurate when predicting outliers or rare events.
-
Preprocessing for Machine Learning: Machine learning algorithms may perform better when the target variable has a specific distribution. If the
Normalized Price
histogram shows a skewed distribution, you might consider applying a logarithmic transformation or other normalization techniques to stabilize variance and improve model performance. -
Error Analysis: By analyzing the distribution of the target variable, you can better understand where your predictive model may be making errors. For example, if the histogram shows that there is a peak in the
Normalized Price
at a certain range, but the model performs poorly in this range, it could indicate a need for more data or a different modeling approach for that specific price range. -
Setting Expectations for Predictive Performance: The variability in the
Normalized Price
as shown by the histogram helps set realistic expectations for the predictive performance of the model. A wide spread in the target variable could imply a more challenging prediction task. -
Benchmarking and Improvements: After building an initial model, you can create a histogram of the predicted values and compare it to the histogram of the actual
Normalized Price
. This comparison can reveal biases in the model's predictions and areas where the model could be improved.
In summary, a histogram of the target variable Normalized Price
is not just a tool for visualization but also a diagnostic tool for model building and evaluation in the context of machine learning and predictive analytics. It provides insights that can guide the entire process of creating, testing, and refining predictive models.
To create a histogram of the Normalized Price
feature from a CSV file using Pandas, we have to know the format of the CSV file.
Fiscal Quarters,2Q FY2018,3Q FY2018,4Q FY2018,1Q FY2019,2Q FY2019,3Q FY2019,4Q FY2019,1Q FY2020,2Q FY2020,3Q FY2020,4Q FY2020,1Q FY2021,2Q FY2021,3Q FY2021,4Q FY2021,1Q FY2022,2Q FY2022,3Q FY2022,4Q FY2022,1Q FY2023,2Q FY2023,3Q FY2023,4Q FY2023
Normalized Price,42.64,47.816,55.8,39.168,50.531,52.569,61.295,79.884,74.151,96.944,116.512,143.22,134.7,147.805,154.224,160.855,165.54,158.6,145.755,151.887,166.662,191.958,178.64
Price / Earnings - P/E (LTM),16.4,17.2,18.6,12.8,16.9,17.7,20.5,25.2,23.1,29.2,35.2,38.5,30,28.7,27.2,26.5,26.7,26,23.7,25.7,28.2,32.1,29
Net EPS - Basic,2.6,2.78,3,3.06,2.99,2.97,2.99,3.17,3.21,3.32,3.31,3.72,4.49,5.15,5.67,6.07,6.2,6.1,6.15,5.91,5.91,5.98,6.16
Return On Equity %,0.4086,0.4537,0.4936,0.4605,0.4913,0.5269,0.5592,0.5547,0.6209,0.6925,0.7369,0.8209,1.034,1.2712,1.4744,1.4557,1.4927,1.6282,1.7546,1.4794,1.4561,1.6009,1.7195
Total Revenues / CAGR 5Y,0.0791,0.0855,0.0922,0.085,0.0799,0.0777,0.0731,0.0602,0.0478,0.0407,0.0327,0.0459,0.0742,0.0952,0.1115,0.1164,0.1186,0.1164,0.1146,0.1013,0.0925,0.0851,0.0761
Net Income / CAGR 5Y,0.0609,0.0825,0.0996,0.0992,0.0868,0.0763,0.0694,0.0529,0.0366,0.0286,0.0146,0.0354,0.0853,0.1267,0.1569,0.1733,0.1739,0.1639,0.156,0.135,0.1208,0.1105,0.1026
Normalized Net Income / CAGR 5Y,0.0505,0.0667,0.0777,0.0686,0.0577,0.0491,0.0421,0.0246,0.0068,-0.0017,-0.0154,0.0052,0.0553,0.0938,0.1222,0.1388,0.1412,0.1357,0.132,0.1122,0.1037,0.0965,0.093
The CSV data clearly indicates that Normalized Price
is a row within the CSV file, not a column header. The data is in a transposed format where each row represents a different financial metric and each column after the first one represents a different fiscal quarter. Hence the data was transposed back to a format more conducive to machine learning in the earlier chapters.
Let's go through the steps to create a histogram of the Normalized Price
feature from a CSV file using Pandas, starting from the beginning. Note that we are using the term 'feature' to refer to the columns in the CSV file and 'observation' to refer to the rows. Normalized Price
is a feature, and each fiscal quarter is an observation. Although it will eventually become the target variable, it is still a feature in the CSV file. Don't get confused by the terminology.
To work with this data and plot a histogram of the Normalized Price
, you would need to read the CSV file into a DataFrame, transpose it, and then access the 'Normalized Price' as a column. Here's how you can do that:
import pandas as pd
import matplotlib.pyplot as plt
data_file = "./data/raw_data-aapl.csv"
image_name = "./data/normalized_price_histogram_pandas.png"
target_column = 'Normalized Price'
# Load the dataset and transpose it
df = pd.read_csv(data_file, index_col=0).transpose()
# Now 'Normalized Price' can be accessed as a column
normalized_price = df[target_column]
# Plot the histogram
plt.hist(normalized_price, bins=10, alpha=0.7, color='blue')
plt.title(f'Histogram of {target_column}')
plt.xlabel(target_column)
plt.ylabel('Frequency')
plt.grid(True)
# Save the plot
plt.savefig(image_name)
plt.show()
plt.close()
To ensure that a plot created with Matplotlib
is saved correctly, it is recommended to call the plt.savefig()
function before plt.show()
. This is because plt.show()
may sometimes clear the figure, depending on the backend Matplotlib uses, which could result in a blank image being saved if plt.savefig()
is called afterward. To avoid this issue and ensure that the plot is saved with all its contents, save the figure to a file before calling show()
.
The plt.close()
function is used to close a Matplotlib figure window, which can help free up the memory that the figure was using. While not typically about preventing memory leaks in a scripting context, using plt.close()
can be particularly important in interactive environments, such as a Python shell or Jupyter notebooks, where many figures may be created. In such cases, closing figures that are no longer needed can help manage system memory effectively. In scripts where figures are created in a loop, it is good practice to close each figure after use to avoid excessive memory use and resource warnings.
This code sets the first column as the index and then transposes the DataFrame so that the fiscal quarters become the index and the financial metrics, including Normalized Price
, become the column headers. You can then plot the histogram for the Normalized Price
as you would with any other DataFrame column.
The code produces the following histogram:
Something seems amiss. However, before we address that, let's redo this using Seaborn
. I simply utilized Matplotlib initially to demonstrate how to accomplish it without relying on any sophisticated libraries. We will be using Seaborn for the rest of the chapter.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data_file = "./data/raw_data-aapl.csv"
image_name = "./data/normalized_price_histogram_seaborn.png"
target_column = 'Normalized Price'
# Load the dataset and transpose it
df = pd.read_csv(data_file, index_col=0).transpose()
# Now 'Normalized Price' can be accessed as a column
normalized_price = df[target_column]
# Set the style of seaborn
sns.set(style="whitegrid")
# Plot the histogram using seaborn
sns.histplot(normalized_price, bins=10, alpha=0.7, color='blue', kde=False)
# Set the labels and title
plt.title(f'Histogram of {target_column}')
plt.xlabel(target_column)
plt.ylabel('Frequency')
# Save the plot
plt.savefig(image_name)
# Show the plot
plt.show()
# Close the plot
plt.close()
This code uses sns.histplot
which is the seaborn equivalent of a histogram. Seaborn is built on top of Matplotlib and uses the same underlying plotting functions. Thus, you can save and show plots in the same way as Matplotlib. The kde
parameter is set to False because we only want the histogram, not the Kernel Density Estimate which is sometimes used in conjunction with histograms to estimate the probability density function. If you wish to include the KDE, you can simply set kde=True
.
Additionally, Seaborn's histplot
function has taken over from the distplot
which was previously used for histograms in Seaborn but is now deprecated. sns.set(style="whitegrid")
is used to set the aesthetic style of the plots. You can change this to other styles like "darkgrid"
, "ticks"
, etc., according to your preference.
Here a dark grid version:
Thesns.histplot()
is used to create a histogram of the Normalized Price
data:
- The
bins=10
parameter specifies that the data should be divided into 10 bins. alpha=0.7
sets the transparency level of the bins.color='blue'
defines the color of the histogram bars.kde=False
indicates that the Kernel Density Estimate (KDE) line, which can be used to show the probability density distribution of the data, should not be plotted.
The rest of the code is self-explanatory.
Let's take a moment to explore the concept of a normal distribution, a term that you might have encountered throughout this book and may be curious about.
Please do yourself a favor and read the following essay before proceeding further:
Normal-Distribution-As-A-Life-Hack.md
Investing your time reading the essay is going to pay off and give you some killer insights for a happier life.
-
Model Assumptions: Many machine learning algorithms assume that the features are normally distributed. This is especially true for algorithms that are based on linear models, such as linear regression and logistic regression. If the data is normally distributed, it simplifies the mathematics behind these models and makes them more robust. We are dealing with a regression problem here, so it's important to understand the distribution of the target variable.
-
Feature Scaling: When features are normally distributed, scaling methods like standardization (which subtracts the mean and divides by the standard deviation) maintain the shape of the distribution. This is important because many algorithms, like support vector machines and k-means clustering, perform better when the features are on similar scales. Performing these operations does not change the shape of the distribution, which is crucial for maintaining the integrity of the data.
-
Algorithm Performance: Algorithms tend to perform well when the data is normally distributed because the prediction errors are also normally distributed, making it easier to manage and model the error terms. Data is normal after all, easy to understand and model.
-
Statistical Measures: In a normal distribution, the mean, median, and mode are all the same. This central tendency makes it easier to describe and interpret the data. For instance, with normally distributed errors, you can use statistical measures like the Z-score to understand the probability of a data point occurring within your data.
-
Outlier Detection: The characteristics of a normal distribution allow for the identification of outliers. Data points that lie many standard deviations away from the mean can be considered outliers, which could either be due to variability in the data or potential errors.
-
Error Minimization Techniques: Techniques like least squares, which are used to optimize many machine learning models, work under the assumption that the errors are normally distributed. This results in models that minimize the sum of the square of the errors, which is a powerful method for finding the best fit in linear models.
-
Confidence Intervals: Normal distributions allow for the construction of confidence intervals for predictions, which can be crucial for understanding the reliability of model predictions.
-
Central Limit Theorem: This theorem states that the sampling distribution of the sample means approaches a normal distribution, even if the population distribution is not normal, as long as the sample size is large enough. This is vital in machine learning for validating models using techniques like cross-validation.
In the context of machine learning, if the target variable, such as Normalized Price
, follows a normal distribution, it may ease the process of modeling as many algorithms can perform better with this assumption met. However, modern machine learning techniques have also advanced to handle non-normal distributions effectively, often through data transformation or algorithms that do not assume normality.
Let's look at the histogram of the Normalized Price
feature again:
What's happening here?
The histogram depicts the distribution of Normalized Price
across different intervals or bins. Here's an analysis of the distribution and some common issues that might arise from it:
-
Distribution Shape: The distribution is not symmetrical and shows a multimodal pattern, which means there are multiple peaks. This suggests that there are several intervals where the 'Normalized Price' is clustered, as opposed to a single, central tendency that a normal distribution would show.
-
Peaks and Valleys: There are noticeable peaks around the 40-60 and 140-160 price ranges, and valleys (lower frequencies) in between these peaks. This could indicate that the data has several different groups or populations with distinct
Normalized Price
characteristics. -
Potential Outliers: The bins on the extremes of the distribution (the far left and far right) could potentially contain outliers if they represent prices that are significantly different from the rest of the data.
-
Common Problems:
-
Non-Normality: Many statistical tests and machine learning algorithms assume that the data follows a normal distribution. The clear deviation from normality in this distribution could pose problems if those methods are used without adjustments or transformations.
-
Multiple Modes: The presence of multiple modes (peaks) could complicate the analysis, as it suggests the potential combination of different populations within the dataset. Understanding the source of these modes is essential for accurate analysis.
-
Unequal Class Intervals: If the bins do not represent equal intervals of
Normalized Price
, it might give a misleading representation of the data distribution. It's important to ensure that each bin covers the same range of prices. -
Sample Size: Depending on the sample size, the histogram might not accurately reflect the population's distribution. A larger sample size could provide a more reliable representation.
-
Skewness: The distribution appears to be skewed, which could be problematic for certain parametric tests that require symmetry.
-
To address these issues, we could: - Apply Data Transformation: To make the data more normally distributed, transformations like the logarithm and square root can be used. - Investigate Subgroups: Analyzing the dataset to understand why multiple modes are present could provide insights and lead to more targeted analysis. - Use Non-Parametric Methods: If normality cannot be achieved, non-parametric statistical methods that do not assume a normal distribution can be considered.
Understanding the distribution is key for making informed decisions about further analysis and potential data preprocessing steps.
However, here's a caveat. We're dealing with a time series dataset, and the distribution of the target variable Normalized Price
is likely to change over time. This is because the stock market is dynamic and the price of a stock can change rapidly. In the context of time series data, such as the Normalized Price
over fiscal quarters, we typically would not apply the same methods to achieve a normal distribution as we would for cross-sectional data. This is because time series data are sequential and often have temporal dependencies, meaning that observations are correlated with past values. This violates the assumption of independence that underlies many statistical techniques designed for cross-sectional data.
In time series analysis, we are often more interested in understanding trends, seasonal effects, and cycles, as well as forecasting future values, rather than fitting the data to a normal distribution. Here are some points to consider:
-
Trends: The data may exhibit a trend over time, which could be upward or downward. In the provided data, there seems to be a general upward trend in the
Normalized Price
. -
Seasonality: There may be seasonal effects, where patterns repeat at regular intervals. This is common in financial data due to quarterly reporting, fiscal policies, and market cycles. A business might see the bulk of its products flying off the shelves during the holiday rush, or you could notice the stock market riding a wild rollercoaster at specific times annually.
-
Autocorrelation: Observations in time series data can be correlated with previous time points (autocorrelation). This needs to be accounted for in the analysis and modeling.
-
Non-Stationarity: Time series data often exhibit non-stationarity, meaning that their statistical properties change over time. Techniques like differencing or transformation (e.g., logarithmic) are used to stabilize the mean and variance.
-
Volatility Clustering: In financial time series, periods of high volatility often cluster together. Advanced models can be used to model and forecast this kind of volatility.
In summary, while the normal distribution is a fundamental concept in statistics, its direct application is more suited to independent and identically distributed (i.i.d.) data rather than time series data. Time series analysis requires a different set of tools and techniques to account for the temporal structure of the data.
How about we check out a straightforward example where we normalize a distribution that's not normal to begin with?
In the histograms above, we see two distributions. The first histogram on the left displays a non-normal distribution, which in this case is an exponential distribution. This type of distribution is characterized by a rapid decrease in frequency as the value increases, showing a skew to the right. It's a common pattern in real-world phenomena where large occurrences are rare, but small ones are frequent.
The second histogram on the right illustrates the distribution after applying a logarithmic transformation to the data. The transformation has modified the scale of the data, pulling in the long tail and stretching out the dense area near the origin. The result is a distribution that appears more bell-shaped and symmetrical, resembling a normal distribution more closely.
The logarithmic transformation is a common technique used to normalize data, especially for right-skewed distributions like the exponential. By reducing skewness, the transformed data may meet the assumptions of various statistical methods better, such as linear regression, which assumes that the residuals (differences between observed and predicted values) are normally distributed.
In machine learning, you'll bump into plenty of log-related transformations: there's log loss, logit models, log-likelihood, log-odds, to name a few. The humble logarithm is a mighty little thing, really handy for reshaping data. It's a go-to move for making data more normal-looking, cutting down on lopsidedness, and getting that variation to chill out. So when 'log' pops up, think normalization, just like you think scaling whenever division rolls into the equation.
However, it's essential to note that not all non-normal distributions can be transformed into a normal distribution, and not all transformations will result in a perfectly normal distribution. Additionally, the choice of transformation depends on the specific characteristics of the data and the requirements of the subsequent analysis or modeling task.
Let's revisit the Yahoo Finance package for a moment. We'll use it for simplicity's sake. After all, you can get the current financial data on a selected set of companies by using yfinance
. Let's pull ROE
(Return on Equity) and Trailing PE
(Price to Earnings) for the 12 tech companies we are looking at here.
import yfinance as yf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Define the ticker symbols
tickers = ["AAPL", "MSFT", "AMZN", "TSLA", "GOOGL", "META", "NVDA", "INTC", "AMD", "ADBE", "NFLX", "AVGO"]
# Initialize an empty DataFrame for stock data
stock_data = pd.DataFrame()
for ticker in tickers:
stock = yf.Ticker(ticker)
# Get current info
info = stock.info
selected_info = {
'Ticker': ticker,
'ReturnOnEquity': info.get('returnOnEquity'),
'TrailingPE': info.get('trailingPE')
}
# Create a DataFrame from the selected info
ticker_df = pd.DataFrame([selected_info])
# Remove rows with NaN
stock_data.dropna(inplace=True)
# Concatenate the new DataFrame with the existing one
stock_data = pd.concat([stock_data, ticker_df], ignore_index=True)
# Display the DataFrame
print(stock_data)
# Setting the aesthetic style of the plots
sns.set(style="darkgrid")
# Plotting the histogram
plt.figure(figsize=(10, 6))
sns.histplot(stock_data['ReturnOnEquity'], kde=True, color='blue')
# Adding title and labels
plt.title('Histogram of Return on Equity (ROE) for Selected Companies')
plt.xlabel('Return on Equity (%)')
plt.ylabel('Frequency')
# Show the plot
plt.show()
Ticker ReturnOnEquity TrailingPE
0 AAPL 1.71950 30.280130
1 MSFT 0.39107 37.606003
2 AMZN 0.12531 80.531250
3 TSLA 0.22460 70.382640
4 GOOGL 0.25334 27.327585
5 META 0.22275 32.994713
6 NVDA 0.69173 71.986840
7 AMD 0.00380 1221.333400
8 ADBE 0.35513 50.511430
9 NFLX 0.21228 49.019920
10 AVGO 0.60312 33.566063
Easy-peasy. Note that we are dealing with NaN
values here. We can't have that. So we drop them.
# Remove rows with NaN
stock_data.dropna(inplace=True)
If we don't drop the NaN
values, we get the this:
Ticker ReturnOnEquity TrailingPE
0 AAPL 1.71950 30.280130
1 MSFT 0.39107 37.606003
2 AMZN 0.12531 80.531250
3 TSLA 0.22460 70.382640
4 GOOGL 0.25334 27.327585
5 META 0.22275 32.994713
6 NVDA 0.69173 71.986840
7 INTC -0.01601 NaN
8 AMD 0.00380 1221.333400
9 ADBE 0.35513 50.511430
10 NFLX 0.21228 49.019920
11 AVGO 0.60312 33.566063
Intel's Trailing PE
is NaN
. We don't want that, so we drop it.
Since we are dealing with a small set of data, we can easily spot unwieldy outliers. We notice that AMD's Trailing PE
is 1221.333400
. That's a significant outlier. And Apple has an ROE of 1.71950
, which means 172%, an almost non-existent figure in the real world. This highlights the importance of using reliable data sources. We can't just pull data from anywhere; we need to exercise caution. The reasons behind these outliers are uncertain; they could be due to errors in the data or the unique nature of the businesses. We can't be sure.
Let's break down the key points and the actions taken:
-
Handling Missing Data (Intel's Trailing PE): The
NaN
value in Intel'sTrailingPE
is a case of missing data. In data analysis, especially with a small dataset, each data point is valuable. However, missing data can skew analysis and lead to inaccurate conclusions. Therefore, dropping rows with missing data, particularly when the dataset is small and the missing data is non-trivial, is a standard approach. -
Identifying and Addressing Outliers:
- AMD's Trailing PE: The extremely high
TrailingPE
for AMD stands out as an outlier. In financial analysis, such outliers can distort average values and other statistical measures, leading to misleading interpretations. In this case, further investigation into why this value is so high is warranted. If it's an error, correction or removal might be necessary. If it's accurate, understanding the context becomes important. - Apple's ROE: An ROE of 171.95% is highly unusual. ROE, or Return on Equity, measures a corporation's profitability by revealing how much profit a company generates with the money shareholders have invested. Extremely high ROE values might indicate exceptional company performance, but they are often a red flag for data issues or anomalies that require further exploration.
- AMD's Trailing PE: The extremely high
-
Reliability of Data Sources: It is crucial to realize the importance of using reliable data sources. Financial data can vary across different platforms. Discrepancies might arise due to differences in how metrics are calculated, the timing of data updates, or outright errors. When performing financial analysis, verifying and cross-referencing data from multiple reputable sources is a good practice to ensure accuracy.
-
Uncertainty Behind Outliers: Outliers can be due to a variety of reasons – data errors, exceptional company-specific events (like a one-time gain or loss), or unique business models that lead to atypical financial metrics. Understanding the cause of outliers is as important as identifying them, as it informs how to treat these data points in your analysis.
Our approach to carefully scrutinize the data, identify and address anomalies, and recognize the importance of reliable data sourcing is a solid foundation for sound financial analysis.
Here's the histogram of the ReturnOnEquity
feature:
We can easily create the histogram for the TrailingPE
feature as well.
plt.title('Histogram of Trailing PE for Selected Companies')
plt.xlabel('Trailing PE')
plt.ylabel('Frequency')
sns.histplot(stock_data['TrailingPE'], kde=True, color='blue')
plt.show()
Analyzing the histograms for both Return on Equity (ROE) and Trailing Price-to-Earnings (PE) ratios, it's clear that there are outliers that could potentially skew the analysis.
For ROE Histogram:
- There is a data point that stands far apart from the rest, which is Apple's ROE at 171.95%. In financial analysis, such an extreme value could be due to extraordinary items or an error in the data.
For Trailing PE Histogram:
- AMD shows an unusually high Trailing PE ratio of 1221.33, which significantly deviates from the rest of the data points.
Suggested Solutions:
-
Verification: Before taking any action, verify the accuracy of these data points. Check the source of your data or look for additional sources to confirm the figures.
-
Outlier Treatment: If the data points are indeed errors, they should be corrected if the correct values are known, or excluded from the dataset if they are not. If they are correct but result from one-time events that are not expected to recur, you might still consider removing them from the dataset for analysis purposes.
-
Data Transformation: For data points that are accurate but unusually high or low, you can apply a data transformation. Logarithmic transformations are commonly used to reduce the impact of outliers on the analysis.
-
Robust Statistical Methods: Use statistical methods that are less sensitive to outliers. For instance, median is more robust than mean, and median-based measures can be used for central tendency and dispersion.
-
Separate Analysis: Conduct a separate analysis for outliers to understand their impact on the dataset. This can provide insights into why these companies have such different financial ratios.
-
Capping: Implement a capping method where you set a threshold, and all values beyond that threshold are set to the threshold value. For instance, you might cap the Trailing PE ratio at a certain percentile.
-
Report Separately: In reporting your findings, you can mention the outliers separately to acknowledge their presence and the potential reasons behind them without letting them unduly influence the overall analysis.
By carefully addressing these outliers, we can ensure a more accurate and representative analysis of the data at hand.
Just a note on high Trailing PEs. Companies undergoing a turnaround may exhibit high Trailing PEs. This is because their past earnings (the denominator in the PE ratio) are low or negative due to previous difficulties. As the company recovers and earnings improve, the PE ratio can normalize. For example, if a company has a Trailing PE of 100, it implies that it would take 100 years for the company to earn back its current market capitalization, which is a long time. However, if the company is a turnaround company, it might manage to earn back its market capitalization in a few years. Understanding the context is crucial. If a company loses a significant amount of money in a year, it might have a negative Trailing PE, but that's not necessarily a bad thing. If it starts making significant profits again, negative or high Trailing PEs can be reversed. The stock price is the numerator in the PE formula, while earnings are the denominator. If earnings are negative, the PE will be negative; if earnings are positive, the PE will be positive. If earnings are zero, so is the PE. The smaller the earnings, the higher the PE. But what if the company starts earning substantially again? Then the denominator increases rapidly, and the resulting PE shrinks.
PE_Ratio = Stock_Price / Earnings
It's a simple arithmetic formula at work here. Understanding the context is key.
A high Trailing PE can also indicate that investors expect higher earnings in the future. They may be willing to pay a premium for the stock now, anticipating growth that will justify the high PE ratio later.
When you encounter a high Trailing PE, it's crucial to dig deeper into the company's financials, consider its earnings outlook, and understand the industry trends and the economic environment in which the company operates. This holistic approach will provide a more accurate picture of the company's valuation and future prospects.
To normalize the Return on Equity (ROE) and Trailing Price-to-Earnings (PE) ratios for the companies in our dataset, we can use several statistical techniques. Normalization helps to bring different variables to a similar scale, allowing for better comparison and integration into certain analytical models.
This method involves transforming the data into a distribution with a mean of 0 and a standard deviation of 1. The formula for the Z-score is:
Z = (X - μ) / σ
Where:
X
is the original value.μ
is the mean of the dataset.σ
is the standard deviation of the dataset.
This method involves scaling the feature to a fixed range, usually 0 to 1. The formula is:
X_normalized = (X - X_min) / (X_max - X_min)
Where:
X
is the original value.X_min
is the minimum value in the dataset.X_max
is the maximum value in the dataset.
Here's a simple Python code snippet to apply Z-score normalization and Min-Max scaling to the ROE and Trailing PE columns:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
# Sample data
data = {
'Ticker': ['AAPL', 'MSFT', 'AMZN', 'TSLA', 'GOOGL', 'META', 'NVDA', 'AMD', 'ADBE', 'NFLX', 'AVGO'],
'ReturnOnEquity': [1.71950, 0.39107, 0.12531, 0.22460, 0.25334, 0.22275, 0.69173, 0.00380, 0.35513, 0.21228, 0.60312],
'TrailingPE': [30.280130, 37.606003, 80.531250, 70.382640, 27.327585, 32.994713, 71.986840, 1221.333400, 50.511430, 49.019920, 33.566063]
}
stock_data = pd.DataFrame(data)
# Remove the outlier for normalization
stock_data_filtered = stock_data[stock_data['Ticker'] != 'AMD']
# Z-Score Normalization
scaler = StandardScaler()
stock_data_filtered[['ROE_Z_Score', 'PE_Z_Score']] = scaler.fit_transform(stock_data_filtered[['ReturnOnEquity', 'TrailingPE']])
# Min-Max Scaling
min_max_scaler = MinMaxScaler()
stock_data_filtered[['ROE_MinMax', 'PE_MinMax']] = min_max_scaler.fit_transform(stock_data_filtered[['ReturnOnEquity', 'TrailingPE']])
print(stock_data_filtered)
In this code:
- We first filter out the outlier (AMD) to prevent it from skewing our normalization.
- We use
StandardScaler
andMinMaxScaler
fromsklearn.preprocessing
to perform the Z-score normalization and Min-Max scaling, respectively. - The
fit_transform
method computes the necessary statistics (mean and standard deviation for Z-score; min and max for Min-Max) and scales the data accordingly.
This will add four new columns to the DataFrame: ROE_Z_Score
, PE_Z_Score
, ROE_MinMax
, and PE_MinMax
, which represent the normalized values for ROE and Trailing PE using both methods.
Remember, normalization is context-dependent. If we're analyzing time-series data or data that needs to maintain its structure over time (like financial ratios), normalization should be done carefully, considering the implications for analysis and interpretation. It is also essential to note that normalization won't fix the underlying issues with data quality or outliers; those need to be addressed separately before or along with normalization.
Now let's apply these concept to our example code and recreate histograms.
import yfinance as yf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# Define the ticker symbols
tickers = ["AAPL", "MSFT", "AMZN", "TSLA", "GOOGL", "META", "NVDA", "INTC", "AMD", "ADBE", "NFLX", "AVGO"]
# Initialize an empty DataFrame for stock data
stock_data = pd.DataFrame()
for ticker in tickers:
stock = yf.Ticker(ticker)
# Get current info
info = stock.info
selected_info = {
'Ticker': ticker,
'ReturnOnEquity': info.get('returnOnEquity'),
'TrailingPE': info.get('trailingPE')
}
# Create a DataFrame from the selected info
ticker_df = pd.DataFrame([selected_info])
# Concatenate the new DataFrame with the existing one
stock_data = pd.concat([stock_data, ticker_df], ignore_index=True)
# Remove rows with NaN
stock_data.dropna(inplace=True)
# Normalize the 'ReturnOnEquity' and 'TrailingPE' columns
scaler = StandardScaler()
stock_data[['ReturnOnEquity', 'TrailingPE']] = scaler.fit_transform(
stock_data[['ReturnOnEquity', 'TrailingPE']])
# Display the DataFrame
print(stock_data)
# Setting the aesthetic style of the plots
sns.set(style="darkgrid")
# Plotting the histogram for normalized Return on Equity (ROE)
plt.figure(figsize=(10, 6))
sns.histplot(stock_data['ReturnOnEquity'], kde=True, color='blue')
plt.title('Normalized Histogram of Return on Equity (ROE) for Selected Companies')
plt.xlabel('Normalized Return on Equity')
plt.ylabel('Frequency')
plt.show()
# Plotting the histogram for normalized Trailing PE
plt.figure(figsize=(10, 6))
sns.histplot(stock_data['TrailingPE'], kde=True, color='blue')
plt.title('Normalized Histogram of Trailing PE for Selected Companies')
plt.xlabel('Normalized Trailing PE')
plt.ylabel('Frequency')
plt.show()
However, even after normalization, the distributions of the ROE and Trailing PE ratios are still not normal. This is because the data is not normally distributed to begin with. Normalization does not change the shape of the distribution; it only changes the scale. Therefore, it's important to understand the distribution of the data before applying normalization techniques.
Understanding the difference between normalization and creating normal distributions is a fundamental concept in data analysis.
Normalization is a scaling technique in data preprocessing. The goal of normalization is to change the values of numeric columns in a dataset to a common scale, without distorting differences in the ranges of values. This is particularly useful when your data has different units or scales. For example, if you're comparing test scores (ranging from 0 to 100) and income (ranging from thousands to millions), normalization helps to bring these on a comparable scale.
There are several methods of normalization as explained earlier.
Making data fit a normal distribution typically involves transformations because you are altering the shape of the distribution itself, not just scaling the range of values. The aim here is to mold the data into that bell curve shape, which is a requirement for some statistical tests and analyses that assume normality.
The key differences are:
- Normalization: Scaling data to a range; doesn't change the shape of the distribution. It's about changing the "size" of the data.
- Normal Distribution: A specific shape of data distribution; to "make data normal" involves transforming data to fit this shape. It's about changing the "shape" of the data.
In practice:
- If your data analysis technique requires data to be on the same scale, use normalization.
- If your data analysis technique assumes a normal distribution (like certain parametric statistical tests), and your data is not normally distributed, consider a data transformation.
It's crucial to choose the right method based on the requirements of your analysis or the algorithms you plan to use. Not all methods will require normally distributed data, and sometimes, attempting to force a normal distribution can distort the real-world relationships present in your data.
In the case of financial ratios like Return on Equity (ROE) and Trailing Price-to-Earnings (PE) ratios, whether you should normalize the data or transform it to fit a normal distribution depends on the purpose of your analysis:
-
Comparative Analysis: If your primary goal is to compare these financial ratios across different companies, normalization (such as Z-score standardization or Min-Max scaling) is appropriate. It puts all companies on a level playing field by adjusting for the scale of the data, allowing for direct comparisons even when the absolute values of ROE or PE are quite different.
-
Statistical Testing: If you plan to perform statistical tests that assume normality (like t-tests or ANOVAs), and your data is not normally distributed, you might consider transforming the data to fit a normal distribution. However, it's essential to note that many financial data sets do not follow a normal distribution due to the nature of the financial markets and company performance.
For financial ratios, here are some considerations:
-
ROE: Since ROE is a profitability ratio, it can vary widely across different sectors and industries. Normalization can be helpful here to compare companies' performance relative to each other. A Z-score normalization could be particularly useful since it retains the sign of the original data (which can indicate whether the ROE is positive or negative).
-
Trailing PE: PE ratios are heavily influenced by market sentiment, future earnings expectations, and industry factors. High PE ratios can indicate growth expectations or overvaluation, while low PE ratios can suggest undervaluation or declining performance. In such cases, normalization can help compare PE ratios across different companies. However, extreme outliers should be handled carefully, as they can represent unique situations that might not be comparable.
In most cases for financial analysis, especially when dealing with ratios, normalization is the more common approach because financial analysts are typically more interested in relative comparisons rather than the absolute fit of the data to a specific distribution. However, always consider the context and the specific analyses you will perform before deciding. If you're using machine learning models or conducting analyses that do not assume normality, normalization without trying to fit a normal distribution might be entirely sufficient.
We're opting for normalization here, but just for the fun of it, let's try to fit a normal distribution to the data.
Fitting data to a normal distribution when it's not naturally normal can be challenging, especially with financial ratios that often have skewed distributions. However, for the sake of exploration, we can attempt to use transformations to see if we can get closer to a normal distribution.
Some common transformations that can help in making the distribution of data more normal are:
- Log Transformation: Useful when data spans several orders of magnitude and has a right-skewed distribution.
- Square Root Transformation: Can help with right-skewed distributions, though not as strongly as log transformations.
- Box-Cox Transformation: A more generalized transformation that can handle both right and left-skewed data.
A log transformation is a common technique used to deal with skewed data in an attempt to approximate a normal distribution, especially for data that spans several orders of magnitude. It works well for data that is positive and right-skewed.
Here is how you would apply a log transformation to the ReturnOnEquity
and TrailingPE
columns in your dataset, and then plot histograms of the transformed data:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming stock_data is your DataFrame with ROE and Trailing PE data
# and it has been cleaned of NaNs and infinities
# Apply log transformation to the positive values only
# Adding a small constant to shift any zero values in the data
stock_data['ReturnOnEquity_Log'] = np.log(stock_data['ReturnOnEquity'] + 1e-9)
stock_data['TrailingPE_Log'] = np.log(stock_data['TrailingPE'] + 1e-9)
# Plot the transformed ROE
plt.figure(figsize=(10, 6))
sns.histplot(stock_data['ReturnOnEquity_Log'], kde=True, color='blue')
plt.title('Log Transformed Histogram of Return on Equity (ROE) for Selected Companies')
plt.xlabel('Log Transformed Return on Equity')
plt.ylabel('Frequency')
plt.show()
# Plot the transformed Trailing PE
plt.figure(figsize=(10, 6))
sns.histplot(stock_data['TrailingPE_Log'], kde=True, color='blue')
plt.title('Log Transformed Histogram of Trailing PE for Selected Companies')
plt.xlabel('Log Transformed Trailing PE')
plt.ylabel('Frequency')
plt.show()
In this code:
- We add a small constant (
1e-9
) to the data before applying the log transformation to avoid issues with log(0). - We use NumPy's
log
function to apply the log transformation. - We plot histograms of the transformed data to check if the log transformation helped in making the data distribution more normal.
It's important to note that the log transformation is most effective for data that does not include negative values, as the logarithm of a negative number is undefined. As explained above, PE ratios can be negative, so this transformation might not be appropriate for the dataset that includes negative PE ratios.
Additionally, while the log transformation can improve the normality of a distribution, it may not result in a perfectly normal distribution, especially if the original data is highly skewed or contains outliers.
Fitting data to a normal distribution when it's not naturally normal can be challenging, especially with financial ratios that often have skewed distributions. However, for the sake of exploration, we can attempt to use transformations to see if we can get closer to a normal distribution.
Here's another code snippet that applies a Box-Cox transformation to the ROE and Trailing PE data and then plots the result to see if the distribution appears more normal:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
# Assuming stock_data is your DataFrame with ROE and Trailing PE data
# and it has been cleaned of NaNs and infinities
# Apply a Box-Cox transformation to the positive values only
# Adding a small constant because Box-Cox cannot handle zero or negative values
stock_data['ReturnOnEquity'] += 1e-9 # To handle 0 values, if any
stock_data['TrailingPE'] += 1e-9 # To handle 0 values, if any
roe_transformed, _ = stats.boxcox(stock_data['ReturnOnEquity'])
pe_transformed, _ = stats.boxcox(stock_data['TrailingPE'])
# Plot the transformed ROE
plt.figure(figsize=(10, 6))
sns.histplot(roe_transformed, kde=True, color='blue')
plt.title('Box-Cox Transformed Histogram of Return on Equity (ROE) for Selected Companies')
plt.xlabel('Box-Cox Transformed Return on Equity')
plt.ylabel('Frequency')
plt.show()
# Plot the transformed Trailing PE
plt.figure(figsize=(10, 6))
sns.histplot(pe_transformed, kde=True, color='blue')
plt.title('Box-Cox Transformed Histogram of Trailing PE for Selected Companies')
plt.xlabel('Box-Cox Transformed Trailing PE')
plt.ylabel('Frequency')
plt.show()
In this code:
- We first add a very small constant to the ROE and Trailing PE data to avoid any zero or negative values, which the Box-Cox transformation cannot handle.
- We use the
boxcox
function fromscipy.stats
to apply the Box-Cox transformation to the data. - We plot histograms of the transformed data to visually assess the normality.
Keep in mind, these transformations are contrived examples to illustrate the concept of fitting data to a normal distribution. Financial data may have inherent characteristics that make it inappropriate to force into a normal distribution for actual analysis and decision-making.
Recall that we already touched on correlation analysis in Chapter 4. Let's revisit it here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the data
file_path = './data/raw_data-aapl.csv'
data = pd.read_csv(file_path)
# Display the first few rows of the DataFrame to ensure it's loaded correctly
data.head()
# Convert the 'Fiscal Quarters' row to header and transpose the DataFrame
data = data.set_index('Fiscal Quarters').transpose()
# Convert all columns to numeric, errors='coerce' will replace non-numeric values with NaN
data = data.apply(pd.to_numeric, errors='coerce')
# Drop any non-numeric columns that could not be converted
data = data.dropna(axis=1, how='all')
# Calculate the correlation matrix
correlation_matrix = data.corr()
# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix for Financial Metrics')
plt.xticks(rotation=45) # Rotate the x labels for better readability
plt.yticks(rotation=0) # Keep the y labels horizontal
plt.show()
Let's go over some important points:
data = data.set_index('Fiscal Quarters').transpose()
- The data is originally in a wide format where each fiscal quarter is a column.
set_index('Fiscal Quarters')
sets the 'Fiscal Quarters' column as the DataFrame index.transpose()
switches the rows and columns. After transposition, each row represents a fiscal quarter, and each column represents a different financial metric.
data = data.apply(pd.to_numeric, errors='coerce')
- This code converts all values in the DataFrame to numeric types. Financial data often contains numbers in formats that pandas does not automatically recognize as numeric (like formatted strings).
errors='coerce'
forces any values that can't be turned into numbers to becomeNaN
(Not a Number). This is useful for handling non-numeric entries safely.
data = data.dropna(axis=1, how='all')
- After conversion, some columns might consist entirely of
NaN
values (especially if they were non-numeric), so this line removes any columns where all values areNaN
.
correlation_matrix = data.corr()
- This line calculates the correlation matrix. In a correlation matrix, each element represents the correlation coefficient between two variables.
- The correlation coefficient is a measure of how much two variables change together. It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation). A value of 0 indicates no correlation.
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix for Financial Metrics')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()
sns.heatmap
is used to create a heatmap of the correlation matrix.annot=True
displays the correlation coefficients in the heatmap.cmap='coolwarm'
sets the color scheme, andcenter=0
centers the colormap at zero.plt.xticks(rotation=45)
andplt.yticks(rotation=0)
rotate the x and y axis labels for better readability.plt.title
adds a title to the heatmap.plt.show()
displays the plot.
This script thus takes financial data, reshapes it, cleans it, calculates correlations among various metrics, and visualizes these correlations in an easy-to-understand heatmap format.
The heatmap represents a Pearson correlation matrix. The Pearson correlation coefficient is a statistical measure of the strength of the linear relationship between two variables. We already explored this concept in Chapter 4. Refer to that chapter for more details.
Wow, we've covered a lot of ground. Let's take a moment to recap what we've learned.
As we close the chapter on data analysis, we've journeyed through the essential steps of turning raw data into meaningful insights—a process integral to any AI-driven endeavor. We began with data preparation, the meticulous act of cleaning and organizing data, which sets the foundation for robust analysis. Like any skilled chef who carefully selects and prepares their ingredients before cooking, we too have learned to curate our data with precision.
We've traversed the landscape of summary statistics, capturing the essence of our data with measures of central tendency and dispersion. These statistics are the seasoning that brings out the underlying flavors of our data, allowing us to understand its core characteristics at a glance.
Our foray into data visualization equipped us with the tools to transform numbers into narratives, using charts and graphs that unveil patterns and trends which might otherwise remain hidden in a sea of numbers. Through visual storytelling, we've seen how complex information can be made accessible and engaging.
Correlation analysis illuminated the relationships between variables, revealing how they move in tandem or diverge. This understanding is crucial when constructing models that predict or infer from one metric to another, helping us build on the interconnectivity of the data elements.
Finally, data transformation techniques have shown us how to reshape and scale our data, making it suitable for analysis by various algorithms. Much like how ingredients can be cut to fit a recipe, we've adapted our data to meet the needs of different analytical methods.
Throughout this exploration, we've emphasized the importance of the analyst's role—your role—in guiding the AI. Tenny, our AI assistant, is a powerful tool, but it's your expertise, judgment, and continuous engagement that steer the course of analysis. As we've seen, the synergy between analyst and AI is what creates a truly effective data workflow.
By embracing active participation, not only do we prevent our minds from becoming idle, but we also refine our skills, keeping our analytical blades sharp and ready for the ever-evolving challenges of AI and data science.
With the conclusion of this chapter, you're now better equipped to lead Tenny through the intricate dance of data analysis. Remember, it's your hands that guide the tools and your vision that shapes the outcome. Let this chapter be a stepping stone to more advanced analyses and a reminder of the vibrant role you play in the AI kitchen.