In short, the main task is to clean The Metropolitan Museum of Art Open Access dataset.
The instructions are not given in detail: It is up to you to come up with ideas on how to fulfill the particular tasks as best as possible!
However, we strongly recommend and require the following:
- Follow the assignment step by step. Number each step.
- Most steps contain the number of features that should be treated. You can preprocess more features. However, it does not mean the teacher will give you more points. Focus on quality, not quantity.
- Properly comment on all your steps. Use Markdown cells and visualizations. Comments are evaluated for 2 points of the total, together with the final presentation of the solution. However, it is not desirable to write novels!
- This task is timewise and computationally intensive. Do not leave it to the last minute.
- Hand in a notebook that has already been run (i.e., do not delete outputs before handing in).
- Download the dataset
MetObjects.csvfrom the repository https://github.com/metmuseum/openaccess/. - Check consistency (i.e., that the same things are represented in the same way) of at least 3 features where you expect problems (including the "Object Name" feature). You can propose how to clean the selected features. However, do not apply cleaning (in your interest). π (1.5 points)
- Select at least 2 features (i.e., 1 couple) where you expect integrity problems (describe your choice) and check the integrity of those features. By integrity, we mean correct logical relations between features (e.g., female names for females only). (2 points)
- Convert at least 5 features to a proper data type. Choose at least 1 numeric, 1 categorical (i.e., ordinal or nominal), and 1 datetime. (1.5 points)
- Find some outliers and describe your method. (3 points, depends on creativity)
- Detect missing data in at least 3 features, convert them to a proper representation (if they are already not), and impute missing values in at least 1 feature using some imputation method (i.e., imputation by mean or median is too trivial to obtain any points). (2 + 3 points, depends on creativity)
- Focus more precisely on cleaning the "Medium" feature. As if you were to use it in the KNN classification algorithm later. (3 points)
- Focus on the extraction of the physical dimensions of each item (width, depth, and height in centimeters) from the "Dimensions" feature. (2 points)
All your steps, your choices of methods, and the following code must be commented on! For text comments (discussion, etc., not code comments), use Markdown cells. Comments are evaluated for 2 points together with the final presentation of the solution.
If you do all this properly, you will obtain 20 points.
- Please follow the technical instructions from https://courses.fit.cvut.cz/NI-PDD/homeworks/index.html.
- Methods that are more complex and were not shown during the tutorials are considered more creative and should be described in detail.
- English is not compulsory.