TikTok Video Claims Classification Model

Project Type: Machine Learning

Coding Language: Python

Packages Used: Random Forest, XGBoost, NLP (natural language processing)

In this project, from Google's Advanced Data Analytics certificate, I am acting as a data professional at TikTok. My supervisor was impressed the work I've have done and has requested that I build a machine learning model that can be used to determine whether a video contains a claim or whether it offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

Classifying videos using machine learning

In this project, I will be using machine learning techniques to predict on a binary outcome variable.

The purpose of this model is to increase response time and system efficiency by automating the initial stages of the claims process.

The goal of this model is to predict whether a TikTok video presents a "claim" or presents an "opinion".

This project has three parts:

Part 1: Ethical considerations

In this scenario, it's preferable for the model to generate false positives rather than false negatives. Identifying videos that violate terms of service is crucial, even if it means some opinion videos are mistakenly labeled as claims. The most severe consequence of misclassifying an opinion as a claim is that the video undergoes human review. However, misclassifying a claim as an opinion could lead to the video not being reviewed, potentially resulting in a terms of service violation. According to the data dictionary, a video violating terms of service is attributed to a "banned" author.

Part 2: Feature engineering

Perform feature selection, extraction, and transformation to prepare the data for modeling

Part 3: Modeling

Task Overview:

TikTok needs a machine learning model to distinguish between claims and opinions in reported videos due to the sheer volume of reports. Claims, rather than opinions, are more likely to violate terms of service. By predicting this distinction, the model can streamline human moderation efforts.

Modeling Approach:

Using the 'claim_status' column from the data dictionary as the target variable, the model aims to classify each video as either a claim or an opinion.

Throughout this project, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

Please view the full code document linked at the top of the page to see the entire process.

The project determines that the Random Forest model is the champion model and is explored below in more detail.

The most predictive features all are related to engagement levels generated by the video. This is not unexpected, as analysis from prior EDA pointed to this conclusion.

Conclusion:

1. Would you recommend using this model? Why or why not?

Yes, the model is recommended as it demonstrated strong performance on both validation and test data, with consistently high precision and F1 scores. It effectively classified claims and opinions.

2. What was your model doing? Can you explain how it was making predictions?

The model primarily utilized features related to user engagement levels (views, likes, shares, and downloads) associated with each video to make predictions.

3. Are there new features that you can engineer that might improve model performance?

Given the current high performance of the model, there's no immediate need for new feature engineering.

4. What features would you want to have that would likely improve the performance of your model?

While the current model doesn't require additional features, including variables such as the number of times a video was reported and the total user reports for each author could potentially enhance its performance.