Calculating cannibalization of organic search visits when SEM advertising is purchased
- SEM = Paid Search = sponsored ads within the same search results
- SEO = Organic Search = non-paid listings presented in the search results
For every visit generated through SEM advertising, what is the impact on SEO visits?
Organic Search drives substantial visits to a company's website, in the order of 20% - 40% of total traffic.
It is believed that when a company purchases SEM advertising, there may be a decrease in Organic Search visits because users that would normally navigate to the companies website via the natural listing use the paid listing instead.
Nugit wishes to get a better understanding of this relationship between results from SEM and SEO with regression analysis.
To get the files, you can either clone the repository:
$ git clone [email protected]:nugit/datascientist-task.git
main.py
Python file for you to get started and is the main executable file.requirements.txt
file for you to list any 3rd-party modules.data/sample_data_Oct.csv
sample CSV data files. Each CSV file contains the daily number of visits for SEO (organic/non-paid) and SEM (paid) in 2014. If you prefer to work with JSON, you may use my Csv-Json switcher python file on github: cjswitchdata/sampleoutput.json
sample output JSON file
- Points marked with [Program] are to be completed in Python. Outputs are in JSON.
- Points marked with [Question] are optional and are for you to show case your statistical/machine learning knowledge. Please keep it short in dot points.
- Feel free to create more python files as necessary. Just make sure that
main.py
is the only file that gets executed. - Feel free to use any python module(s) as necessary. Just remember to add it in
requirements.txt
- Ensure that your .py files follow the pep8 coding style guide.
main.py
+ otherpython files
- for you to show-off your logicrequirements.txt
- to list any 3rd-party modulesoutput.json
- your JSON results. Please seedata/sampleoutput.json
for an example submission.submission.md
orsubmission.html
orsubmission.txt
orsubmission.pdf
- for you to write your answers/comments/suggestions. We recommend using the online notebook wakari.
Please do not write your answers in a word doc.
Submit your completed task to [email protected] by providing a link to a private bitbucket or github repository or somewhere online to view the files. Feel free to email me any questions.
Using the last 26 weeks of data in data/sample_data_Oct.csv
:
- [Program] Fit the data into a regression function of the form
y = mx + b
- [Program] Using the function, calculate the impact (the SEO value) when SEM has the highest number of visits
- [Program] Using the function, calculate the impact (the SEO value) when SEM has the median number of visits
- [Question] Are there other statistical methods that can show the impact of SEM visits on SEO visits?
y = mx + b
where: y is the dependent variable SEO
x is the independent variable SEM
m is gradient
b is the y-intercept
Round gradient
and yintercept
to 2 decimal places and maxSEM
and maxSEMimpact
are integer number of visits.
{
"filename": "sample_data_Oct.csv",
"datarange": "12weeks"
"gradient": 0.02
"yintercept": 2004.45
"maxSEM": 3000
"maxSEMimpact": 1000
"medianSEM": 1250
"medianSEMimpact": 2200
}
(B) Model Validation
- [Program] Calculate the Correlation Coefficient and the Coefficient of Determination
- [Question] What does the result of the coefficients tell you about the regression function and the data?
- [Question] Determining how well the data fits into your regression function can be done by calculating the correlation coefficient. However, it is also known that this is not a good measure of model validation. What other approaches could you use? Feel free to program this if you wish.
{
"r": 0.922,
"rsquared": 0.850
}
- [Program] Perform the same analysis as in (A) and (B), but over a 12 week period
-
Provide tests to accompany your python functions. At nugit, we use unittest and nose with codecoverage
-
Provide an approach to remove outliers. Feel free to program this.
-
You will notice that there is a difference in results by using 3 and 6 months of data for trend estimation. How would you go about de-trending the data to produce a more accurate picture of the relationship?
-
Chart your results using any JavaScript library