1. INTRODUCTION
I participated in openHPI’s Data Science Bootcamp 2023 and we were given a dataset, titled ToyotaCorolla, to practice MVA (multi-varied analysis) under the lecture ‘EDA and Statistical Analysis’. The aim was to predict the price of a used car based on some given variables (its features such as km, hp, gear, etc.). Later on I decided to practice more and use this dataset to answer some other questions.
The instructors were not sure that the dataset was real or a dummy set and could not provide the source of it. So before I started my analysis, I wanted to be sure to use the original dataset if it exists.
1.1 Why did I bother to spend time just to find the source?
Well, I believe that every data research should start with questioning the data itself. In real-life we have to ‘trust’ in a lot of things and we may not have proper time for such an approach, but I wanted to get this as a habit while I am still in the entry-level zone in data science.
1.2 The Source
Unfortunately I couldn’t reach the ‘official’ source of the data, but I found where it (probably) first appeared. The information below will also give a hint about the source and the owner of this particular dataset.
1.3 Is it a real-life data sample?
After doing a comprehensive EDA on this dataset and finding information as I listed below, I am convinced that the data is most probably a real one. In my EDA the data performed consistently and the pieces of information below do not interfere with each other.
1.4 Disclaimer
I do not own the Corolla dataset that is the focus of this post. This research is only for educational purposes and it is done just to know more about the source and how it was used before. The information below is taken from mentioned websites that are publicly available.
2. WHAT IS THE COROLLA DATASET ABOUT?
I’ll start with the only description available online. This is what you get when you type the name of this dataset in your search bar:
“The dataset ToyotaCorolla.xls contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications.”
After coming across some students’ posts regarding this dataset, I have a sense that this dataset was created by a dealership, which offers Toyota owners to buy their used cars as part of a trade-in. Sounds like that the dealers themselves collected the data on their previous sales and used it to come up with the best offer for those clients.
I mentioned ‘students’, because those pages were showing assignments with some course codes on them. This gave me the idea that there might be some textbooks with this dataset.
2.1 The Books
I found three books mentioning this Toyota Corolla dataset, but there are probably more. The oldest among them is the following book, that dates back to 2010:
Shmueli, Patel and Bruce (2010) Data Mining for Business Intelligence. Second edition. John Wiley & Sons, Inc., Hoboken, NJ.
The second one dates back to 2013:
Ledolter (2013) Data Mining and Business Analytics with R. John Wiley & Sons, Inc., Hoboken, NJ.
The third one will be shared below.
2.2 The Authors (of the books above, not the data itself)
Galit Shmueli mentions that the book is being used for the course ‘Business Analytics using Data Mining’ that is a postgraduate elective course at ISB. But on her website I couldn’t find any data source.
I could not find a personal website of Peter Bruce.
I could not find a personal website of Nitin Patel either. However, I found a page that shows his course ‘Data Mining’ in Mit OpenCourseware. Unfortunately, the course dates back to 2003. A year shorter than the mentioned dataset.
Johannes Ledolter on his page in the university website, did provide some information, including a csv file. But it contains only 10 columns: Price, Age, Km, Fuel Type, Met Color, Automatic, CC, Doors and Weight. It looks like he limited the course exercises with those variables only.
2.3 Other Courses
While searching for the related courses, I’ve come across some others too.
Roger Bohn from UCSanDiego mentioned that he gave a course titled ‘Big Data Analytics’ and used the below textbook. (the third one that I mentioned above)
Shmueli, Bruce, Yahav, Patel and Lichtendahl (2017) Data Mining for Business Analytics in R. John Wiley & Sons, Inc.
The university has two webpages for the mentioned courses that look identical and both have the Corolla dataset (in xlsx format) in full version.*
*In that file, R Bohn created one extra column for his study that easily can be omitted.
I’d also like to mention that Hedibert Freitas Lopes was the only person I’ve come across who mentioned the first two books together that I listed at the beginning.
As you may have already noticed, all these books are from the same publisher. Probably they own the dataset. But it seems that there is no information by whom and how the data is collected. The mentioned dealer could be Bovag, simply because -as you’ll see below– their name is in the dataset. But I couldn’t find information regarding this dataset in their website as well.
3. WHAT IS IN THE COROLLA DATASET?
Below I’ll give a list of all columns that are in this dataset. The description of the variables are taken from that xlsx file that I mentioned above provided by RB.
0 | Id | Record Id |
1 | Model | Model Description |
2 | Price | Offer price in euros |
3 | Age_08_04 | Age in months as in August 2004 |
4 | Mfg_Month | Manufacturing month (1-12) |
5 | Mfg_Year | Manufacturing year |
6 | KM | Accumulated kilometers on odometer |
7 | Fuel_Type | Fuel Type (Petrol, Diesel, CNG) |
8 | HP | Horse Power |
9 | Met_Color | Metallic Color (Yes=1, No=0) |
10 | Color | Color (Blue, Red, Grey, Silver, Black, etc.) |
11 | Automatic | Automatic (Yes=1, No=0) |
12 | CC | Cylinder Volume in cubic centimeters |
13 | Doors | Number of doors |
14 | Cylinders | Number of cylinders |
15 | Gears | Number of gear positions |
16 | Quarterly_Tax | Quarterly road tax in euros |
17 | Weight | Weight in kilograms |
18 | Mfr_Guarantee | Within Manufacturer’s Guarantee period (Yes=1, No=0) |
19 | BOVAG_Guarantee | BOVAG (Dutch dealer network) Guarantee (Yes=1, No=0) |
20 | Guarantee_Period | Guarantee period in months |
21 | ABS | Anti-Lock Brake System (Yes=1, No=0) |
22 | Airbag_1 | Driver Airbag (Yes=1, No=0) |
23 | Airbag_2 | Passenger Airbag (Yes=1, No=0) |
24 | Airco | Air Conditioning (Yes=1, No=0) |
25 | Automatic_airco | Automatic Air Conditioning (Yes=1, No=0) |
26 | Boardcomputer | Board computer (Yes=1, No=0) |
27 | CD_Player | CD Player (Yes=1, No=0) |
28 | Central_Lock | Central Lock (Yes=1, No=0) |
29 | Powered_Windows | Powered Windows (Yes=1, No=0) |
30 | Power_Steering | Power Steering (Yes=1, No=0) |
31 | Radio | Radio (Yes=1, No=0) |
32 | Mistlamps | Mist lamps (Yes=1, No=0) |
33 | Sport_Model | Sport Model (Yes=1, No=0) |
34 | Backseat_Divider | Backseat Divider (Yes=1, No=0) |
35 | Metallic_Rim | Metallic Rim (Yes=1, No=0) |
36 | Radio_cassette | Radio Cassette (Yes=1, No=0) |
37 | Parking_assistant | Parking assistance system (Yes=1, No=0) |
38 | Tow_Bar | Tow Bar (Yes=1, No=0) |
There are totally 1436 rows excluding the headers. The Id column starts from 1, ends at 1442. None of the files that I found had those missing 6 rows. It seems that the authors dropped them out at the very beginning.
3.1 Csv Files in the Internet
OK, not the whole web but I checked GitHub and Kaggle too.
A full version of this file can be found in GitHub under the collection of sample datasets, prepared by Reisanar. That file dates back to 2018 and updated in 2020.
In Kaggle the most upvoted one (at the time when I did this research) has two missing columns (Color and Parking Assistant). It was updated 6 years ago.
The Corolla dataset that I used for my analysis is the full version that is identical from the UCSD’s course page. It was updated 3 years ago.
3.2 Why did I not create a new dataset with all this information?
I don’t want another duplicate file and as you’ve (hopefully) read and seen that all these sources don’t have the information of the initial source and its methods.
The dealer is probably Bovag, but then we don’t know the license information. Again, all these are for educational purposes only. I just wanted to be sure that this dataset is consistent and can be used for practices. That’s all.
FINAL WORDS
I hope this information here will help you somehow. Please feel free to check my analysis and provide some feedback too. I’d appreciate that!