Numbers in Computers
Understanding how computers represent and handles numbers under the hood will save you from many unexpected issues and trouble in your analysis works.
Broadly speaking, computers have 2 ways to represent numbers: integers (for whole numbers) and floating point numbers (for decimal numbers).
Regardless of the type, all numbers are stored in bits in the computer memory.
The following explanations are very good; and there’s no reason to try to write my own explanations when these exist.
Integers¶
Integers, as a datatype, are wonderful. They are precise and pretty intuitive. However, they have a major pitfall: integer overflow and underflow.
Watch this video
Floating Point Numbers¶
Since Integers can only represent whole numbers, we need to use floating point numbers, and other could argue that they can't represent really big numbers. So how do we deal with decimals and really big numbers? Floating point numbers!
To learn about floating point numbers, please:
Watch this video
and Read this
2 major problems you find when working on numbers in computers in general are
- Integer overflow
- Leads to situations like adding two big numbers producing a negative number
- Leads to situations like subtracting two small big negative numbers producing a positive number.
- Floating point precision.
- Computers can't keep track of very large numbers and very small ones at the same time (in the same calculation)
- Situations where very simple arithmetics gives weird results.
For example:
# Example 1: Imprecise arithmetics calculations
0.1 + 0.1 + 0.1
# example 1.1: check if 0.1 + 0.1 + 0.1 EQUALS 0.3
0.1 + 0.1 + 0.1 == 0.3
# Example 2: can't keep track of very big and very small numbers in the same calculations, leading to a lot of rounding errors
print(2.32781**55 + 10 == 2.32781**55)
The videos above, explains this problem perfectly.
Number Problems
Problem 1: Integer Overflow
This is not an alternative to the videos above Consider the following:
- You have 5 bits to represent a signed number (positive and negative). 0️⃣0️⃣0️⃣0️⃣0️⃣
- The computer is storing this information as bits of 1's and 0's.
- One of the available bits will be used to represent the sigh (0️⃣ for positive, 1️⃣ for negative)
- This means, you only have 4 bits to represent your number. Only the numbers from (-15) to (0) to (15)
- 1️⃣1️⃣1️⃣1️⃣1️⃣ (-15)
- 0️⃣0️⃣0️⃣0️⃣0️⃣ (0)
- 0️⃣1️⃣1️⃣1️⃣1️⃣ (15)
- So if we tried to add up (15 + 15), what would we get.
0️⃣1️⃣1️⃣1️⃣1️⃣
+ 0️⃣1️⃣1️⃣1️⃣1️⃣
------------
1️⃣1️⃣1️⃣1️⃣0️⃣ (which is -14)
Also what if try to add +15 + 1
0️⃣1️⃣1️⃣1️⃣1️⃣
+ 0️⃣0️⃣0️⃣0️⃣1️⃣
------------
1️⃣0️⃣0️⃣0️⃣0️⃣ (which is -0; yes in computers, it's a thing)
A tradeoff was made when designing python to prioritize ease of use, over performance. The idea being that if anyone needed performance, they could go with languages like C at the time.Actually up until this time, Python is still the most performant programming language, and arguable the most complex.
How did this manifest in python? You can't overflow integers (the problem described here, can't happen)
This design decision is evident in how python deals with numbers. In computers, generally, values and Numbers are stored in memory using bits. For example, if you have 64-bit computer, you can store integer numbers are stored in 64 bits which means it can store numbers between $0\to 2^{63}-1 $. One of those bits is where we store whether a number is positive or negative.
Traditionally, if we had 2**63 and made it even bigger, it should overflow, and the calculation would yield a negative number.
# In most modern computers, we have 64-bit processors. which means that I can have 64-bit integers.
large_number = 2**63
print(large_number)
# Making this very large number, even bigger should mean overflowing the 64 bits.
very_large_number = large_number ** 4 # too large it can't fit in 64 bits
print(very_large_number) # however, it doesn't overflow.
Python is convenient enough to allow for that to happen easily.
This convenience comes at the expense of performance and speed. Simple integer computation isn’t just about computing the results anymore. , it stops to check whether you the result has been allocated enough bits to store the result, and, if not, it just adds more bits! So if you do math with an integer that won’t fit in 64 bits, it will just allocate more bits to the integer!
This performance hit wouldn’t make python a good option for data analysis. Where speed is of the utmost importance and where we work with a huge datasets of data.
This is why we use libraries like Numpy
and Pandas
. They make calculations a lot more faster and a lot more efficient.
How fast? Let's try a (toy example) of adding numbers using vanilla python and then again using Numpy
.
We'll measure the speed of execution and the memory usage.
import numpy as np
import pandas as pd
import time # used to time the execution of the code
from pympler import asizeof # used to measure the size of the data structures
# Make a regular Python list
# with all the numbers up to one hundred million
# Remember `range` doesn't include the last number,
# so I have to go up to 100_000_001 to actually get all
# the numbers from 1 to 100_000_000
one_to_one_hund_mil_list = list(range(1, 100_000_001))
# Now make a numpy vector
one_to_one_hund_mil_vector = np.arange(1, 100_000_001)
start = time.time()
total = 0
for i in one_to_one_hund_mil_list:
total = total + i
pass
end = time.time()
python_total = end - start
print(f" Python took {python_total:.3f} seconds")
start = time.time()
# Now we sum up all the numbers in the array
# using the numpy `sum` function.
np.sum(one_to_one_hund_mil_vector)
end = time.time()
numpy_total = end - start
print(f"Numpy took {numpy_total:.3f} seconds")
print(f"Numpy was {python_total / numpy_total:.1f}x faster!")
You could argue that a loop implementation isn't the most efficient. you're right it's not; Numpy
would still be drastically more efficient though.
start = time.time()
sum(one_to_one_hund_mil_list)
end = time.time()
sum_python_total = end - start
print(f"Numpy was {sum_python_total / numpy_total:.1f}x faster!")
Even from a memory perspective
# `asizeof.asizeof()` gets the size of an object
# and all of its contents in bytes, so we'll
# divide it's output by one billion to get
# the value in gigabytes.
list_size_in_gb = asizeof.asizeof(one_to_one_hund_mil_list) / 1_000_000_000
vector_size_in_gb = asizeof.asizeof(one_to_one_hund_mil_vector) / 1_000_000_000
print(f"The Python list of numbers took up {list_size_in_gb:.2f} GB of RAM")
print(f"The numpy vector of numbers took up {vector_size_in_gb:.2f} GB of RAM")
print(
f"That means the Python list took up {list_size_in_gb/vector_size_in_gb:.0f}x "
"as much space as the numpy vector!"
)
Everything is the world of software design and architecture is a tradeoff. You don’t get the efficiency with nothing to give.
Numpy
and Pandas
you specify the type and size of the data you store and they don’t check for integer overflows.
import numpy as np
# 63 bits because the sign bit is used
a = np.array([2**63-2, 2**63-1], dtype='int')
a
so if you try to increase the values by one
a + 1 # add 1 to every element in the overflow; you get a negative number (overflow)
x = np.array([2**15-1], dtype='int16') # 15 bits; 16th bit is the sign bit
x + 1 # another overflow (negative number)
# The bits can also underflow (wrap around) with unsigned integers
x = np.array([0], dtype='uint16')
x - 1 # underflow
Problem 2: Floating Number Precision
The other major problem you get with numbers is floating number precessions: you get this problem whether you’re using numpy or vanilla python.
Because computers store numbers (even decimal ones uses bits). Can't stress enough to watch the video.
and Read this
If you have 64 bits integer, you get one bit for the sign, 52 bits for the mantissa and 11 bits for the exponent.. the number decides where to place the decimal. An act that results in a tradeoff between the size and accuracy of your numbers that you store.
A manifestation of this issue can be demonstrated by this.
0.1 + 0.1 + 0.1 == 0.3 # False
These numbers as you’d learn from the video aren’t represented on the computers line of numbers. Not accurately at least.
About Pandas
Pandas is the second package you’ll most commonly use in your analysis projects. It provides a flexible data structure to allow you to work with relational and labeled data sets.
It allows for:
- Easy search and filtering of the data
- Easy handling for missing data (we’ll talk more about that later in the class)
- Easily merge and join data sets
- Easy reshaping of data
- Easy handling of time series datasets: or data that span over a long time.
The 2 most common data structures we’d use from pandas are:
- Series
- Data Frames
About Series
A series is an ordered collection of values, generally of the same type. Kind of like a numpy array. (Actually not kind of .. It is). Pandas uses numpy arrays to build its series.
Series are central to pandas because pandas was designed for statistics, and Series are a perfect way to collect lots of different observations of a variable.
To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:
Day of Week | Attendees |
---|---|
Monday | 132 People |
Tuesday | 94 People |
Wednesday | 112 People |
Thursday | 84 People |
Friday | 254 People |
Saturday | 322 People |
Sunday | 472 People |
To represent that using Pandas Series
import pandas as pd
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
attendance
You can get the underlying numpy array by calling the values function.
attendance.values
This is good to know because every now and then you may find a tool that works with numpy arrays but not pandas. And when that happens, you now know how to pull out the numpy array underlying your Series and use it directly!
Series also also allows you to have named indices for your elements in a series.
attendance.index
It’s also augmented with additional features to improve your development workflow.
You can also sort the days by the attendance
attendance = attendance.sort_values()
attendance
You could also use Pandas to subset datasets using indices, logical expressions and predicates
About DataFrames
A pandas data frame is a tabular data structure, a 2d array. It can do everything you’d generally do when a generalized series structure that allows for more than just 2 dimensions.
The biggest difference here, is that we have more than one index, one for rows, and another for columns.
df = pd.DataFrame({'animals': ['dog', 'cat', 'bird', 'fish'],
'can_swim': [True, False, False, True],
'has_fur': [True, True, False, False]})
df
animals | can_swim | has_fur | |
---|---|---|---|
0 | dog | True | True |
1 | cat | False | True |
2 | bird | False | False |
3 | fish | True | False |
We can also construct data frames from other sources: (csv files, excel spreadsheets, databases, ...)
We'll see that in the next post