[go: nahoru, domu]

Jump to content

User:Moonstar0619/sandbox

From Wikipedia, the free encyclopedia

Privacy+ Topic: Contact scraping (5th draft)[edit]

For broader coverage of this topic, see Web scraping and Data scraping.

Contact scraping is the practice of obtaining access to a customer's e-mail account in order to retrieve contact information that is then used for marketing purposes.

The New York Times refers to the practices of Tagged, MyLife and desktopdating.net as "contact scraping". [1]

Several commercial packages are available that implement contact scraping for their customers including ViralInviter, TrafficXplode, and TheTsunamiEffect. [2]

(The above is the original Wikipedia article of Contact scraping)


Contact scraping is one of the applications of web scraping, and the example of email scraping tools include Uipath, Import.io, and Screen Scraper. [3] The alternative web scraping tools include UzunExt, R functions, and Python Beautiful Soup. The legal issues of contact scraping is under the legality of web scraping.

Web Scraping Tools[edit]

Following web scraping tools can be used as alternatives for contact scraping:

  1. UzunExt is an approach of data scraping in which string methods and crawling process are applied to extract information without using a DOM Tree . [4]
  2. R functions data. rm() and data. rm.a() can be used as a web scraping strategy. [5]
  3. Python Beautiful Soup libraries can be used to scrape data and converted data into csv files. [6]

Legal Issues[edit]

United States[edit]

In the United States, there exists three most commonly legal claims related to web scraping: compilation copyright infringement, violation of the Computer Fraud and Abuse Act (CFAA), and electronic trespass to chattel claims. For example, the users of "scraping tools" may violate the electronic trespass to chattel claims. [7] One of the well-known cases is Intel Corp. v. Hamidi, in which the US court decided that the computer context was not included in the common law trespass claims. [8][9] However, the three legal claims have been changed doctrinally, and it is uncertain whether the claims will still exist in the future. [7][10] For instance, the applicability of the CFAA has been narrowed due to the technical similarities between web scraping and web browsing. [11] In the case of EF Cultural Travel BV v. Zefer Corp., the court declined to apply CFAA since EF failed to meet the standard for "damage". [12]

The EU[edit]

By the Article 14 of the EU’s General Data Protection Regulation (GDPR), data controllers are obligated to inform individuals before processing personal data. [13] In the case of Bisnode vs. Polish Supervisory Authority, Bisnode obtained personal data from the government public register of business activity, and the data were used for business purpose. However, Bisnode only obtained email addresses for some of the people, so the mail notifications were only sent to those individuals. Instead of directly informing other people, Bisnode simply posted a notice on its website, and thus it failed to comply with the GDPR’s Article 14 obligations. [14][15]

Australia[edit]

In Australia, address‑harvesting software and harvested‑address lists must not be supplied, acquired, or used under the Spam Act 2003. The Spam Act also requires all marketing emails to be sent with the consent of the recipients, and all emails must include an opt-out facility. [16] The company behind the GraysOnline shopping websites was fined after sending emails that breached the Spam Act. GraysOnline sent messages without an option for recipients to opt-out of receiving further emails, and it sent emails to people who had previously withdrawn their consent from receiving Grays' emails. [17][18]

China[edit]

Under the Cybersecurity Law of the People's Republic of China, web crawling of publicly available information is regarded as legal, but it would be illegal to obtain nonpublic, sensitive personal information without consent. [19] On November 24, 2017, three people were convicted of the crime of illegally scraping information system data stored on the server of Beijing ByteDance Networking Technology Co., Ltd. [20]

See also[edit]

Reference[edit]

Mary Jane 404 Peer review[edit][edit][edit]

This is where you will complete your peer review exercise. Please use the following template to fill out your review.

General info[edit][edit][edit]

Lead[edit][edit][edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer?
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation[edit][edit][edit]

The lead is a bit disorganized but does include the topic covered.

[edit][edit][edit]

Guiding questions:

  • Is the content added relevant to the topic? Yes
  • Is the content added up-to-date? Yes
  • Is there content that is missing or content that does not belong? No
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics? No

Content evaluation[edit][edit][edit]

The article's content is relevant to the topic, up-to date.

Tone and Balance[edit][edit][edit]

Guiding questions:

  • Is the content added neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the content added attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit][edit][edit]

The content does not appear biased.[edit]

Sources and References[edit][edit][edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit][edit][edit]

The sources I saw were good but I couldn't see them listed at the bottom. I think you need a few more.

Organization[edit][edit][edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read? yes
  • Does the content added have any grammatical or spelling errors? No
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic? Yes

Organization evaluation[edit][edit][edit]

The content is well written and easy to read. It's a bit underdeveloped, but the specific legal area it focuses on is explained well.

Images and Media[edit][edit][edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic? N/A
  • Are images well-captioned? N/A
  • Do all images adhere to Wikipedia's copyright regulations? N/A
  • Are the images laid out in a visually appealing way? N/A

Images and media evaluation[edit][edit][edit]

The article does not include images.

For New Articles Only[edit][edit][edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation[edit][edit][edit]

Overall impressions[edit][edit][edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation[edit][edit][edit]

The article is easy to read but underdeveloped. the specific area covered is well written but I think there are more angles to view it from besides legality.

Peer Review Draft 3 - Quackdon[edit]

General info[edit][edit]

Lead[edit][edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer?
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation[edit][edit]

The lead has been updated to reflect new contact and includes information that is present in the article as well as the article's major sections. The lead is concise and includes an introductory sentence that concisely describes the article's topic.

Content[edit][edit]

Guiding questions:

  • Is the content added relevant to the topic?
  • Is the content added up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation[edit][edit]

The content added is relevant to the topic and is added the up-to-date. I recommend adding more information about the process of contact scraping. While it's definitely informative to inform the reader tools that are used to preform contact scraping, more information about the process may enhance clarity of the topic. The article doesn't deal with equity gaps or historically underrepresented populations.

Tone and Balance[edit][edit]

Guiding questions:

  • Is the content added neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the content added attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit][edit]

The content added is neutral and there isn't bias towards a particular position. No viewpoints re overrepresented or underrepresented.

Sources and References[edit][edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit][edit]

The sources are backed up by reliable academic or peer reviewed journals. The sources reflect the available literature on the topic and the date of the sources are current. The sources are written by a spectrum of authors and the links are accessible.

Organization[edit][edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read?
  • Does the content added have any grammatical or spelling errors?
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation[edit][edit]

The content added is well written and enhances clarity of the legal issue part. There aren't any grammatical or spelling errors observed. The content added is well organised and is broken down into sections, contact scaping, web scraping and legal issues.

Images and Media[edit][edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation[edit][edit]

No images are added.

For New Articles Only[edit][edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation[edit][edit]

Overall impressions[edit][edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation[edit][edit]

The content added has improved the reader's understanding of how contact scraping relates to legal issues and it is more complete. As mentioned, I think the article can be improved by disclosing the process of contact scraping that could hopefully complement my understanding of the tools used in contact scraping.

Evaluate an article (1)[edit]

This is where you will complete your article evaluation. Please use the template below to evaluate your selected article.

Lead[edit]

Guiding questions
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation[edit]

The lead includes an introductory sentence that concisely and clearly describes the article's topic. Overall, the lead provides an insightful introduction of the article's topic. The lead can be improved through introducing the article's major sections.

Content[edit]

Guiding questions
  • Is the article's content relevant to the topic?
  • Is the content up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation[edit]

The article's content is relevant to the topic and relatively comprehensive and up-to-date. However, the content may contain equity gaps- for instance, people looking for information privacy on art would not be able to find relevant content.

Tone and Balance[edit]

Guiding questions
  • Is the article neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the article attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit]

The article has a neutral tone as there is no claim that appears heavily biased toward a particular position. Most of the content is based on fact instead of viewpoints.

Sources and References[edit]

Guiding questions
  • Are all facts in the article backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit]

All facts in the article are backed up by a reliable and thorough secondary source of information, and links of resources do work. The sources are also current as the it includes sources from recent years.

Organization[edit]

Guiding questions
  • Is the article well-written - i.e. Is it concise, clear, and easy to read?
  • Does the article have any grammatical or spelling errors?
  • Is the article well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation[edit]

The article is well-written as it is concise, clear, and easy to read. The article is well-organized and does not have any grammatical or spelling errors.

Images and Media[edit]

Guiding questions
  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation[edit]

The article does not contain any image.

Checking the talk page[edit]

Guiding questions
  • What kinds of conversations, if any, are going on behind the scenes about how to represent this topic?
  • How is the article rated? Is it a part of any WikiProjects?
  • How does the way Wikipedia discusses this topic differ from the way we've talked about it in class?

Talk page evaluation[edit]

There are academic conversations going on behind the scenes about how to represent this topic. The article is rated as solid. The way Wikipedia discusses this topic differs is that the conversations are more formal.

Overall impressions[edit]

Guiding questions
  • What is the article's overall status?
  • What are the article's strengths?
  • How can the article be improved?
  • How would you assess the article's completeness - i.e. Is the article well-developed? Is it underdeveloped or poorly developed?

Overall evaluation[edit]

The article's overall status is relatively solid as it is well-developed. The strengths are that the content is clear and concise, and the resources cited are reliable. The article can be improved by covering a broader topics under the section of "Information types".


Evaluate an article (2)[edit]

This is where you will complete your article evaluation. Please use the template below to evaluate your selected article.

  • Name of article: Bank secrecy (Bank secrecy )
  • I have chosen this article to evaluate because I'm interested in the industry of finance.

Lead[edit]

Guiding questions
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation[edit]

The lead includes an introductory sentence that concisely and clearly describes the article's topic. However, the lead does not include a brief description of the article's major sections, and the information in the lead is too detailed. Overall, the lead provides an insightful introduction of the article's topic but it needs to be more concise .

Content[edit]

Guiding questions
  • Is the article's content relevant to the topic?
  • Is the content up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation[edit]

The article's content is relevant to the topic but it mainly contain links to other wikipedia articles. The content may contain equity gaps- for instance, people looking for banking privacy in Asia would not be able to find relevant content.

Tone and Balance[edit]

Guiding questions
  • Is the article neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the article attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit]

The article has a neutral tone as there is no claim that appears heavily biased toward a particular position. Most of the content is based on fact instead of viewpoints.

Sources and References[edit]

Guiding questions
  • Are all facts in the article backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit]

All facts in the article are backed up by a reliable and thorough secondary source of information, and links of resources do work. The sources are also current as the it includes sources from recent years.

Organization[edit]

Guiding questions
  • Is the article well-written - i.e. Is it concise, clear, and easy to read?
  • Does the article have any grammatical or spelling errors?
  • Is the article well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation[edit]

The article is easy to read but it has very limited content. The article does not have any grammatical or spelling errors.

Images and Media[edit]

Guiding questions
  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation[edit]

The article includes two images which are well-captioned and adhere to Wikipedia's copyright regulations.

Checking the talk page[edit]

Guiding questions
  • What kinds of conversations, if any, are going on behind the scenes about how to represent this topic?
  • How is the article rated? Is it a part of any WikiProjects?
  • How does the way Wikipedia discusses this topic differ from the way we've talked about it in class?

Talk page evaluation[edit]

There are academic conversations going on behind the scenes about how to represent this topic. There are some controversies regarding the content of the article. The way Wikipedia discusses this topic differs is that the conversations are more formal, and some sentences are not easy to understand.

Overall impressions[edit]

Guiding questions
  • What is the article's overall status?
  • What are the article's strengths?
  • How can the article be improved?
  • How would you assess the article's completeness - i.e. Is the article well-developed? Is it underdeveloped or poorly developed?

Overall evaluation[edit]

The article's overall status is relatively solid as it is concise and the resources cited are reliable. The article can be improved in the way that provides a brief introduction above the link for each section of content. The article may also cover equity gap by including section of banking privacy in Asian countries.


Privacy+ Topic: Contact scraping (outline and Source)[edit]

In the current form of the article of "Contact Scraping", there is only a section of the definition of contact scraping. I plan to add sections about the history of contact scraping, techniques for contact scraping, as well as the legal issues of contact scraping. I still need to do more search to find out whether I would be able to add more sections.


Section 1: History

- So far I haven't found any academic journal articles related to this topic. I may need to refer to the history section in article "Web Scraping".


Section 2: Techniques for contact scraping

- Most of academic journal articles I've found so far are related to this topic. However, I need to classify the techniques I found (ie. HTML/DOM parsing) and translate them into less technical words.


Section 3: Legal issues of contact scraping

- I had a few law journals related to this topic, and I got permission to use it as long as they don't count as my . However, I still need some time to read those law journals.


The list of Wikipedia article I plan to link to:

  1. Web Scraping
  2. Data Scraping


Below is a list of peer reviewed journal articles that I plan to cite.

Topic: Contact scraping

1. A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9051800

Uzun, E. 2020. “A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages.” IEEE Access, Access, IEEE 8 (1) : 61726–40.

This is a journal article which focuses on a new approach of data scraping named UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree. The article first acknowledges that most previous studies on web scraping are about automatically extracting web content by ignoring time efficiency. Then the article introduces a novel approach named UzunExt. The approach provides time efficiency by employing the string methods and crawling process-the process to collect the additional information, including the starting position for enhancing the searching process and the number of inner tags for improving the extraction process. The average extraction time of the simple way in this approach gives better efficiency than the average extraction time of AngleSharp parser with approximately 60 times. At the end, the researchers of the article state that they plan to develop an effective and efficient web scraper that can create datasets automatically. The target audience for this article is scholars who are interested in web scraping and the methods to improve the efficiency of data scraping tools, especially people who are studying both automatic and manual methods to determine extraction patterns. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with web scraping and its tools, while it is difficult for others to understand the terms used in the article. The article is relatively biased since it only lists the advantages of the new approach of data scraping without its disadvantages compared to the traditional approach of data scraping tools. I learned about the information about extraction patterns through the data provided by the article, such as appropriate element for efficient extraction. Meanwhile, the article introduced me to a new method of data scraping that may be helpful in practice by saving time during data extraction


2. Monitoring e-commerce adoption from online data

https://libproxy.berkeley.edu/login?auth=shib&qurl=https%3A%2F%2Flink.springer.com%2Fcontent%2Fpdf%2F10.1007%2Fs10115-018-1233-7.pdf

Blazquez, D., Josep D., Jose G., et al. 2019. “Monitoring E-Commerce Adoption from Online Data.” Knowledge and Information Systems: An International Journal 60 (1): 227-45.

This is a journal article which focuses on an intelligent system, System for Automatically Monitoring E-commerce Adoption (SAME), to automatically monitor the firms’ engagement in e-commerce. The article begins by introducing the context of web scraping and existing big data learning methods which mainly focuses on relating the popularity of keyword searches (generally obtained from Google Trends reports). The article then introduces SAME, the intelligent system developed for automatically detecting and monitoring e-commerce availability. The article also describes the way the system has been tested and evaluated. The evaluation of the system was performed by applying the predictive model to 426 corporate websites from manufacturing companies based in France and Spain. At the end, the researchers of the article states that they plan to improve SAME to increase the level of detail on the provided output. The target audience are scholars who are interested in the methods to monitoring e-commerce adoption or the application of web scraping tools. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with web scraping and its tools, while it is difficult for others to understand the terms used in the article. The article is relatively biased since it only highlights the benefit of using intelligent systems without acknowledging the limitations of the system—SAME makes it possible to track the evolution of this activity in real time; thus, it is capable of offering more frequent information related to e-commerce than official surveys. In addition, the information about e-commerce adoption can now be discovered directly from the Web, without specifically asking firms. The article introduces me to a new tool to retrieve and analyze data from corporate websites to discover the adoption of e-commerce, which may be helpful in future when I need to automatically retrieve data using scraping tools like SAME


3. Increasing online shop revenues with web scraping: a case study for the wine sector

https://repositori.udl.cat/bitstream/handle/10459.1/68768/030083.pdf;jsessionid=6E8299B57DA6DE6258E47E1E5DCAE0CC?sequence=1

Jorge, O., Adria P., Josep R., et al. 2020. “Increasing Online Shop Revenues with Web Scraping: A Case Study for the Wine Sector.” British Food Journal 1 (1):1-19.

This is a journal article which focuses on using web scraping to increase revenues for the company “QuieroVinos, S.L.”, an online wine shop founded in 2015 that sells Spanish wines in two main marketplaces. The article begins by introducing the historical development of the wine industry and e-commerce, as well as the role of pricing in the marketing mix for e-commerce. Then the article discusses the application of web scraping to track the prices of competitors for a set of products. The application mainly uses a tool called Selenium to access each Marketplace BackOffice. The application could be used by specialized online stores which use marketplaces. Uvinum and Vivino, to sell their products. From the deployment of the application, sales through the marketplaces have increased 104% in average. At the end, the researchers of the article state that they plan to study other alternatives such as making the requests through proxy servers. The target audience are scholars who are interested in the application of web scraping in the real business world, since this article mainly discusses the way web scraping tool-Selenium helps businesses increase revenue. In terms of reading difficulty, the first part of article, which generally introduced the background of wine industry and e-commerce, is easy to understand; the second part of article, which describes the algorithms Selenium uses to access backOffice, has a higher level of reading difficulty for people who are not familiar with web scraping and its tools. The article is relatively objective since it not only highlights the increased revenue caused by the application of web scraping, but also acknowledges the limitation faced by the application, which is the high levels of protection of marketplaces. The article introduces me to Selenium, a new tool which can be used to increase sales through the marketplaces by collecting information regarding the competitor’s price of the products selling on each corresponding marketplace.


4. A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research

https://www-proquest-com.libproxy.berkeley.edu/docview/1790926611?accountid=14496

Landers, R., Robert B., Katelyn C., et al. 2016. “A Primer on Theory-Driven Web Scraping: Automatic Extraction of Big Data from the Internet for Use in Psychological Research.” Psychological Methods 21 (4): 475–92.

This is a journal article which focuses on the application of web scraping in psychology. The first part of article points out that web scraping offers more potential for psychology by increasing access to behavioral data without significant researcher intrusion, dramatically increasing sample sizes, decreasing the amount of time spent during the data collection phase, increasing access to researchers in newly industrialized and undeveloped nations who cannot afford traditional large-scale research projects, and improving the interdisciplinary application of our vast research literature on psychometrics. The second part of the article acknowledges that the application of web scraping in psychology may lead to invalid data sources. Thus, researchers need to carefully consider what theory can be tested, which results can be replicated meaningfully, and where theory can be reasonably expanded. The article concludes the greatest benefits from web scraping will be realized when combining the behavioral data to which it provides access with other data sources already commonly in use. The target audience are scholars who are studying the application of web scraping in experiment or are interested in doing research in psychology. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with statistical methods for experiments; meanwhile, the article may be difficult for others to understand since the article includes specific terms such as confidence interval. The article is relatively objective since it not only states the benefits of the application of data scraping in psychology as mentioned above, but also discusses the problem of the validity of data source caused by the application of data scraping in psychology. The article introduces me to exciting possibilities for psychology offered by theory-driven web scraping. I also learned that web scraping may cause the dataset used for experiments to be invalid so I need to be aware of this problem if I apply web scraping in conducting experiments in future.


5. Strategies to access web-enabled urban spatial data for socioeconomic research using R functions

https://libproxy.berkeley.edu/login?auth=shib&qurl=https%3A%2F%2Flink.springer.com%2Fcontent%2Fpdf%2F10.1007%2Fs10109-019-00309-y.pdf

Vallone, A., Coro C., and Beatriz S.. 2020. “Strategies to Access Web-Enabled Urban Spatial Data for Socioeconomic Research Using R Functions.” Journal of Geographical Systems: Spatial Theory, Models, Methods, and Data 22 (2): 217-34.

This is a journal article which focuses on data extraction strategies (R functions) to generate databases for Spain at the municipality level. The article begins by introducing the most common difficulties when working with secondary Internet-enabled data, which can be grouped into two categories: accessibility and availability problems. Then the researchers use the URL parsing strategy to solve accessibility problems presented on official agency web portals. They build a set of R functions parquet.aut() in order to download, load, and manipulate population and unemployment databases. The researchers also deal with availability problems in the construction of the rm database, for which they applied a web scraping strategy with the functions data. rm() and data. rm.a() to download the information of firms and freelancers freely published by a private company. At the end, the researchers conclude that the creation of a rm database is very helpful to facilitate knowledge about the distribution of economic activities in Spain at urban and individual levels. The target audience are scholars who are studying geographical scales using web tools or who are trying to solve the data problems of availability and accessibility. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with spatial analysis and its tools as well as R language and its functions; meanwhile, the article is difficult for others to understand the terms used in the article such as the argument and output of R functions. The article is relatively biased since it only highlights the fact that R functions can be used to solve the accessibility and availability problems, while it fails to acknowledge the limitation of R functions on analysis. Also, the article does not indicate the future improvement for the R functions. The article introduces me to R functions which are useful for socioeconomic research since it solves the accessibility and availability problems for datasets.


6. Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing

https://www.tandfonline.com/doi/full/10.1080/19345747.2019.1654576

Anglin, K. 2019. “Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing.” Journal of Research on Educational Effectiveness 12 (4): 685–706.

This is a journal article which focuses on gather-narrow-extract framework, a framework that is used to study local policy variation. The article begins by providing the context of local policy— there is only limited information available in administrative datasets that capture the set of decisions districts make in response to an opportunity or mandate. Then the researchers introduce the gather-narrow-extract framework, which provides researchers a template for how structured information from student and staff manuals as well as text documents located on the internet can be extracted automatically. At the end, the researchers state that gather-narrow-extract brings many potential advantages. Web-scraping and Natural Language Process (NLP) within the framework drastically reduce the time from research question to data-in-hand, increasing the speed at which researchers can produce answers to timely policy questions. In addition, gather-narrow-extract increases the replicability of research since either the original researcher or colleagues can simply rerun the original scripts to update or confirm analyses. The target audience are scholars who are interested in the application of web-scraping and natural language processes, as well as people who are willing to learn a new framework for data extraction. In terms of reading difficulty, the article has a moderate level of difficulty since it does not have specific terms in computer science, and it clearly explains each term with detailed definition. The article is relatively objective since it not only highlights the advantage of the gather-narrow-extract framework, but also acknowledges its disadvantages, including high expense and difficulty in collecting observational data. The article introduces me to the gather-narrow-extract framework. The framework would be helpful for me to conduct experiments in the future—validity and statistical power with minimal resources are increased as data collection and analysis can now be scaled to entire populations of interest.


7. Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. https://eds-a-ebscohost-com.libproxy.berkeley.edu/eds/pdfviewer/pdfviewer?vid=1&sid=795ad387-abf8-4ab7-bb18-3ef0b06e4c02%40sdc-v-sessmgr02

Polidoro, F., Giannini, R., Lo Conte, R., et al. 2015. “Web Scraping Techniques to Collect Data on Consumer Electronics and Airfares for Italian HICP Compilation.” Statistical Journal of the IAOS 31 (2): 165–86.

This is a journal article which focuses on the results of testing web scraping techniques in the field of consumer price surveys. The article begins by introducing the three main features of the modernization of data collection: widening the use of electronic devices, accessing scanner data as a source for inflation estimates, and enlarging the adoption of web scraping techniques. Italian National Statistical Institute (Istat) has developed tests that were carried out within the present situation of Italian consumer price survey regarding those three features. Throughout the test, the adoption of web scraping techniques saves more than 30% of time in terms of managing the survey results. At the end, the researchers conclude that using web scraping techniques exclusively can make the current survey more efficient in terms of time saving without expanding the data collection. Improving and maintaining specific macros in order to extend web scraping techniques to the entire data collection of airfares in the web could enhance the quality and allow Istat data collectors to move to other activities, even if for a limited amount of hours. The target audience are researchers who are studying the efficiency of web scraping techniques. In terms of reading difficulty, the article has a relatively high level of difficulty since the tone is quite academic and most sentences are relatively long. The article is relatively objective since it not only highlights the benefits of using web scraping techniques, but also acknowledges their disadvantages including coverage error and sampling error. I learned that the adoption of web scraping techniques would allow us to collect price information in the field more efficiently. I agree with the researchers that the tests can be extended in order to scrape also information concerning the characteristics (brands, varieties) of the elementary items for which prices are collected.


8. Web scraping based online consumer price index: The "IPC Online" case.

https://eds-a-ebscohost-com.libproxy.berkeley.edu/eds/pdfviewer/pdfviewer?vid=1&sid=7dcd63df-b259-40c3-b2b2-31f119061462%40sessionmgr4008

Uriarte, J., Gonzalo R., and Juan L.. 2019. “Web Scraping Based Online Consumer Price Index: The ‘IPC Online’ Case.” Journal of Economic & Social Measurement 44 (2/3): 141–59.

This is a journal article which focuses on the application of web scraping on online consumer price index. The article presents a successful case of private web scraped data based entrepreneurship in that working with a minimum work team size it was managed to replicate and to adapt consumer price and construction cost indexes for a mid-urban area in Argentina. The results support the conclusion that compared with institutions that estimate inflation with standard procedures (hand-picking prices), the process of online data collection is remarkably efficient while it still faces tradeoffs. At the end, the researchers conclude that the final outcome in the case of IPC Online does not deviate much from the more complex and costly process of obtaining data: inflation rates are remarkably similar even using more frequency, more data, and less personnel. The target audience are scholars who are studying the application of web scraping or scholars who are interested in market analysis. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with statistics tools, while it is difficult for others to understand the terms used in the article since the article contains terms such as density function. The article is relatively objective since it not only highlights the benefits of Web scraped data-based prices indexes, but also acknowledge the problem caused by it-the lack of regularity in the presence of product offers in the web price list and a more intrinsic problem of whether the actual captured price is what a consumer is facing when going to a supermarket. I agree with the researchers that many of the uses discussed in the recent and growing literature on web scraped-data based indices remark the optimism in the implementation of this information in the traditional indices as complementary information.


9. Scraping social media data for disaster communication: how the pattern of Twitter users affects disasters in Asia and the Pacific

https://ucelinks-cdlib-org.libproxy.berkeley.edu/sfx_local/cgi/core/sfxresolver.cgi?tmp_ctx_obj_id=1&service_id=110974982655298&tmp_ctx_svc_id=1&request_id=114749824

Kusumasari, B., and Prabowo N.. 2020. “Scraping Social Media Data for Disaster Communication: How the Pattern of Twitter Users Affects Disasters in Asia and the Pacific.” Natural Hazards 103 (3): 3415-3435.

This is a journal article which focuses on the role of social media data in disaster communication. The article begins by stating that in the traditional disaster management model, information flows from emergencies to the public. Then the researchers provide a description of patterns of social media usage by communities that are directly and indirectly affected by disasters in the Asia–Pacific region. The study also lists previous studies on the potential of social media during disasters. Several examples of this potential include geographical mapping of disaster-affected areas, relief efforts and health improvement for affected communities, and source of disaster-related information. Nevertheless, the study identifies obstacles that may hinder the process of realizing such potential. For instance, countries with strict social media censorship would generate few local tweets from disaster areas. Aside from communication channels being cut off, the lack of tweets from disaster-affected areas may be associated with government policies concerning social media. The target audience are people who are interested in the application social media data. In terms of reading difficulty, the article has a moderate level of difficulty since it does not contain specific terms in computer science or statistics. The article is relatively objective since it acknowledges that this study has limitations- the numerous disasters examined in this research also inhibited the possibility of conducting in-depth analysis on particular disasters. Moreover, this study did not consider the causal factors of disasters having a unique tweet composition. For instance, the usage categories of government criticism and request for help had a small portion of tweets, which may be due to strict social media censorship applied by the government. I agree that realizing the potential of social media to map out disaster-affected areas would be difficult due to the lack of tweets in the Asia–Pacific region.


10. Data Is the New Oil--But How Do We Drill It? Pathways to Access and Acquire Large Data Sets in Communication Science

https://eds-a-ebscohost-com.libproxy.berkeley.edu/eds/detail/detail?vid=0&sid=9e37ebdc-72e7-4cc3-98fb-79817016e178%40sdc-v-sessmgr02&bdata=JnNpdGU9ZWRzLWxpdmU%3d#AN=edsgcl.610367737&db=edsger

Possler, D., Bruns S., and Niemann-Lenz J.. 2019. “Data Is the New Oil--But How Do We Drill It? Pathways to Access and Acquire Large Data Sets in Communication Science.” International Journal of Communication (Online), 1 (1): 3894-911.

This is a journal article which focuses on studying the way communication science could overcome the challenges of accessing large data sets in the future. The article begins by pointing out it is crucial to improve options for acquiring large communication data resources. The researchers make a step in this regard by systematizing the current data acquisition options and outlining their specific barriers. One issue raised rather often in the literature is that providers, publishers, and producers of online services are not (legally) obliged to grant scientific access to their resources. A second factor that affects data quality are restrictions in the completeness and structure of the data output. The output of web crawling lacks this additional information and that intensive data cleansing is needed. At the end, the researchers conduct an interview on people’s perception on data tools and conclude that technical training should be improved at all career levels to equip scholars with sufficient methodological knowledge. The target audience are researchers who are looking for methods to access large data sets.  In terms of reading difficulty, the first part of article, which discusses the methods of cooperation with firms to gain data or buying data from data owners directly, is easy to understand; the second part of article, which discusses the methods of application programming interfaces (APIs) and data scraping, has a higher level of reading difficulty for people who are not familiar with API and web scraping tools. The article is relatively objective since it acknowledges the limitation of the study—the study undertook several measures to enhance sample diversity—such as recruiting males and females from diverse countries and various scientific career steps. I agree that great power comes with great responsibility with ethical principles, so data owners should provide researchers with a legal way to access data when they sell data to researchers.


11. A Semi-Automatic Data-Scraping Method for the Public Transport Domain

https://eds-a-ebscohost-com.libproxy.berkeley.edu/eds/detail/detail?vid=0&sid=42d1e59e-661e-495c-acd7-d07bba705ec1%40sdc-v-sessmgr03&bdata=JnNpdGU9ZWRzLWxpdmU%3d#AN=000481972100020&db=edswsc

Vela, B., Cavero, J., Caceres P., et al. 2019. “A Semi-Automatic Data-Scraping Method for the Public Transport Domain.” IEEE 7 (1): 1027–56.

This journal article is from Institute of Electrical and Electronics Engineers. The article focuses on the data extraction and processing of the existing information on the web concerning public transport and its accessibility for the generation of an open data repository in which to store this information. The article begins by stating the importance of transport data since everyday people have to move around to perform daily tasks. However, there is an important lack of information regarding the accessibility of the routes and sites. Then the researchers introduced the method of using Python for the semi-automatic generation of a data scraper for the public transport domain. This method allows the extraction of public transport data and the existing accessibility information from a selected website. At the end, the researchers conclude that the future work is to semi-automatically generate a scraper for the metros of all the systems analyzed in order to complete the repository. The target audience are people who are studying web scraping techniques or are interested in the application of web scraping in the real world. In terms of reading difficulty,  the first part of the article, which generally introduces the background of transport data, is easy to understand; the second part of article, which describes the Python code in detail, has a higher level of reading difficulty for people who are not familiar with Python. The article is relatively biased since it only highlights the fact that method of using Python can be used to solve the accessibility problems, while it fails to acknowledge the limitation of this Python method in data scraping.  The article introduces me to a new method of data extraction that can solve the accessibility problems of data scraping. I may include this method as an alternative for contact scraping tools in my Wikipedia article.


12. Scraping the demos. Digitalization, web scraping and the democratic project

https://www-tandfonline-com.libproxy.berkeley.edu/doi/pdf/10.1080/13510347.2020.1714595?needAccess=true

Ulbricht, L. 2020. “Scraping the demos. Digitalization, web scraping and the democratic project.”Democratization 27 (3): 426-42.

This journal article is from Democratization. The article focuses on the democratic implications of demos scraping, which include using demos scraping advocates to reduce the gap between political elites and citizens. The article begins by introducing demos scraping, which is promoted by its proponents as a new and better tool for knowing the citizenry, to meet citizens’ expectations and to provide political legitimacy. However, the article then states that demos scraping is a danger to many forms of political participation. Since there is furthermore little to no control of large technology companies, demos scraping creates vast possibilities for the abuse of citizen data by companies and governments. Then the researchers acknowledge the limitation of the paper, which is it mainly focuses on discourse and not on practice. At the end, the researchers conclude that A ban of demos scraping is not necessary, but it should be handled with care. For example, the algorithms in demos scraping should have more transparency. The target audience are people who are interested in politics or people who would like to learn more about demos scraping-both its strength for knowing the citizenry better and shortcomings for having many biases. In terms of reading difficulty, the article has a moderate level of difficulty since it does not contain technical terms such as coding. The article is relatively objective since it not only states the researchers’ conclusion that there is clearly no reason to assume that digital tools will solve any of the problems related to insufficient political participation, biased representation and imperfect responsiveness. As a result, people should beware of technological quick fixes to long-standing democratic deficits and democratic fatigue. Meanwhile, the researchers’ acknowledge the paper’s limitation as mentioned above. The article introduces me to a new concept of demos scraping, which re-defines representation and participation by challenging the concepts of the autonomous subject and the mature citizen and by replacing citizens with consumers and individuals with “data doubles”.  This article may not be helpful for my wikipedia article since it’s about demo scraping, which is not relevant to my topic of contact scraping.


13. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining.

https://ebookcentral-proquest-com.libproxy.berkeley.edu/lib/berkeley-ebooks/detail.action?docID=1824310#goto_toc

Selig, K.. 2017. “Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining.” Biometrics 73 (4): 1469-507.

This journal article is from Biometrics. The article focuses on the data collection on the Web with R, and it provides an overview of three areas. The first area is the technologies that allow the distribution of content on the Web, which includes XML/HTML, AJAX, and JSON. The second area is the technologies for web data collection is needed to retrieve the information from the files, which includes JSON parsers, Selenium, and Regular expressions. Most of the article is about the third area, which is the technologies for the collection of web data deals with data storage. The researchers point out that regarding automated data collection, databases are of interest for two reasons: firstly, one might occasionally be granted access to a database directly and should be able to cope with it; secondly, although R has a lot of data management facilities, it might be preferable to store data in a database rather than in one of the native formats. The target audience are people who are interested in web scraping techniques, or people who already have some text data and need to extract information from it. In terms of reading difficulty, the article has a higher level of reading difficulty for people who are not familiar with web scraping tools or R function, since the article contains many sections of R code which require previous exposure to R. The article is relatively biased since the article not only lists the benefits of the technologies, but also states their limitations. For example, the researchers point out that in many instances the ordinary data storage facilities of R suffice, by importing and exporting data in binary or plain text formats. The article introduces me to the concept that R is capable of interpreting HTML, that is, it knows how tables, headlines, or other objects are structured in the file format. This article may not be helpful for my wikipedia article since it’s about web scraping and text mining, which would be too complicated for my topic of contact scraping.


14. Cloud Data Scraping for the Assessment of Outflows from Dammed Rivers in the EU. A Case Study in South Eastern Europe

https://libkey.io/libraries/83/articles/412500104/full-text-file

Skoulikaris, C., and Krestenitis, Y.. 2020. “Cloud Data Scraping for the Assessment of Outflows from Dammed Rivers in the EU. A Case Study in South Eastern Europe.” Sustainability 12 (6): 7926-45.

This journal article is from Sustainability. The article focuses on web service-related technologies and tools utilized to manage, transfer and process big data and facilitate environmental research.The article begins by introducing the context that although currently there is a plethora of data published on the internet, by national and international official sources, their retrieval is sometimes hard to be achieved. Then the researchers state that in the applied methodology, the use of web scrapers is proposed to integrate data from open cloud-based databanks created for the implementation of two EU Directives, namely the Electricity Market Directive (EMD) and the Water Framework Directive (WFD). The methodology indicates solutions of the direct usage of the derivatives of EU policies in the field of integrated water resources management. At the end, the researchers conclude that one of the main outputs of the research is web scrapers facilitate the retrieval of environmental data that are regularly published on web pages and are not accessible through other data-sharing technologies. The target audience are people who are interested in the application of web scraping tools or are studying integrated water resources management. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with web service-related technologies as well as environment management. The article is relatively biased since it only highlights the fact that web service-related technologies can be used to facilitate environmental research with the advantage of unlimited download potentiality while APIs are having limited usage policies; meanwhile, it fails to acknowledge the limitation of those technologies. The article introduces me to the new insights of how web scrapers can help the sustainable management of water resources and the quantification of rivers’ outflows to the coastal zone, since web scrapers can be used to retrieve and download dams’ outflows. This article may not be helpful for my wikipedia article since it’s about cloud web scraping, which would be too complicated for my topic of contact scraping.

15. Cloud Data Scraping for the Assessment of Outflows from Dammed Rivers in the EU. A Case Study in South Eastern Europe

https://libkey.io/libraries/83/articles/412500104/full-text-file

Skoulikaris, C., and Krestenitis, Y.. 2020. “Cloud Data Scraping for the Assessment of Outflows from Dammed Rivers in the EU. A Case Study in South Eastern Europe.” Sustainability 12 (6): 7926-45.

This journal article is from Sustainability. The article focuses on web service-related technologies and tools utilized to manage, transfer and process big data and facilitate environmental research.The article begins by introducing the context that although currently there is a plethora of data published on the internet, by national and international official sources, their retrieval is sometimes hard to be achieved. Then the researchers state that in the applied methodology, the use of web scrapers is proposed to integrate data from open cloud-based databanks created for the implementation of two EU Directives, namely the Electricity Market Directive (EMD) and the Water Framework Directive (WFD). The methodology indicates solutions of the direct usage of the derivatives of EU policies in the field of integrated water resources management. At the end, the researchers conclude that one of the main outputs of the research is web scrapers facilitate the retrieval of environmental data that are regularly published on web pages and are not accessible through other data-sharing technologies. The target audience are people who are interested in the application of web scraping tools or are studying integrated water resources management. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with web service-related technologies as well as environment management. The article is relatively biased since it only highlights the fact that web service-related technologies can be used to facilitate environmental research with the advantage of unlimited download potentiality while APIs are having limited usage policies; meanwhile, it fails to acknowledge the limitation of those technologies. The article introduces me to the new insights of how web scrapers can help the sustainable management of water resources and the quantification of rivers’ outflows to the coastal zone, since web scrapers can be used to retrieve and download dams’ outflows. This article may not be helpful for my wikipedia article since it’s about cloud web scraping, which would be too complicated for my topic of contact scraping.


16. Utilising a Data Capture Tool to Populate a Cardiac Rehabilitation Registry: A Feasibility Study

https://pubmed.ncbi.nlm.nih.gov/30718155/

Thomas, E., Sherry G., Douglas B., et al. 2020. “Utilising a Data Capture Tool to Populate a Cardiac Rehabilitation Registry: A Feasibility Study.” Heart, Lung and Circulation 29 (2): 214–32.

This journal article is from Heart, Lung and Circulation. The article focuses on assessing the feasibility of using a data scraping tool to collect cardiac rehabilitation (CR) minimum variables from electronic hospital administration databases to populate a new CR registry in Australia. The article begins by providing the context that clinical registries are effective for monitoring clinical practice, yet manual data collection can limit their implementation and sustainability. Then the researchers introduce a pre-programmed, automated data capture tool named GeneRic Health Network Information for the Enterprise (GRHANITE). This tool is installed at the sites to extract data available in an electronic format from hospital sites. The GRHANITE tool was successfully installed at the two CR sites and data from 176 patients were securely extracted between September-December 2017. At the end, the researchers conclude that the key benefits of a scalable, automated data capture tool like GRHANITE cannot be fully realized in settings with under-developed electronic health infrastructure. The target audience are people who are interested in the application of data scraping or are studying clinical registries. In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with data scraping and statistical methods of conducting experiments, while it is difficult for others to understand the terms of statistics used in the article. The article is relatively objective since it acknowledges the limitation of the GRHANITE tool that it would not perform well if the electronic health infrastructure is under-developed. The article introduces me to how a data scraping tool can help with the clinical registry that monitors the quality of CR provided to patients. This article may not be helpful for my wikipedia article since it’s about data scraping tools, which would be too complicated for my topic of contact scraping.


17. Development of an automated climatic data scraping, filtering and display system

https://www-sciencedirect-com.libproxy.berkeley.edu/science/article/pii/S0168169909002348

Yang, Y.,  Wilson, T., and Wang, J.. 2010. “Development of an Automated Climatic Data Scraping, Filtering and Display System.” Computers and Electronics in Agriculture 71 (1): 67–87.

This journal article is from Computers and Electronics in Agriculture. The article focuses on an automated climatic data scraping, filtering and display system named Integrated Agricultural Information and Management System (iAIMS). The article begins by providing the context that agriculture is facing the challenge of trade restrictions and shrinking agricultural profit margins. As a result, agriculturalists need to increase efficiency, and web applications can help avoid the problems related to user-side installation and data distribution. Then the researchers introduce iAIMS, which provides a common foundation for developing diversified applications that require dynamic access to data. The climatic data building process is broken down into 5 major program modules: Data Requester, Data Fetcher, Data Parser, Data Filter, and Data Explorer. At the end, the researchers conclude that as the need grows for multi-spatial scale simulation and analysis of crop production, so will the need for consolidated climatic, soil, cropland databases that provide transparent and dynamic access to the underlying data. The target audience are people who are interested in data scraping, filtering, and exploring systems or are studying climatic databases. In terms of reading difficulty, the first part of the article, which generally introduced the background of agriculture, is easy to understand; the second part of the article, which describes the iAIMS and its application on climatic database, has a higher level of reading difficulty for people who are not familiar with data processing. The article is relatively biased since it only highlights the fact that iAIMS can be used to help agriculturalists increase efficiency, while it fails to acknowledge the limitation of iAIMS. The article introduces me to an application that helps with cropping systems’ performance and management. This article may not be helpful for my wikipedia article since it’s about data processing tools, which would be too complicated for my topic of contact scraping.


18. Ownership and control over publicly accessible platform data.

https://ruor.uottawa.ca/bitstream/10393/39755/1/Scassa-OIR-openaccess.pdf

Scassa, T.. 2019. “Ownership and Control over Publicly Accessible Platform Data.” Online Information Review 43 (6): 986-1002.

This journal article is from Online Information Review. The article examines how claims to ‘ownership’ are asserted over publicly accessible platform data and assesses the scope of rights to reuse these data. The article begins by providing the context that a diverse range of users (including researchers, journalists, competing and non-competing businesses) make use of publicly-accessible platform data for multiple purposes. Then the researchers discuss the legal tools that platform companies can use to assert control over the publicly accessible data hosted on their sites. These tools can be divided into three categories. The first category involves legal claims based upon ‘ownership’ rights – whether this involves ownership of intellectual or personal property. The second involves limitations on access that may involve contractual terms or technological barriers. The third involves privacy rights. At the end, the researchers conclude that different legislative solutions are available, such as to provide clarification of the scope of protection available to compilations of data, including publicly accessible platform data. The target audience are people who plan to use publicly accessible platform data or are studying the complex ecosystems behind publicly accessible platform data. In terms of reading difficulty, the article has a moderate level of difficulty since it does not have specific terms in computer science or detailed description of statistical methods. The article is relatively objective since it reminds the audience that the combination of barriers erected to data scrapers and the laws that reinforce them can raise ethical issues, such as in the context of journalism. The article introduces me to a publicly accessible data ecosystem which is important to the ownership of publicly accessible data. This article may not be helpful for my wikipedia article since it’s about publicly accessible platform data, which is not relevant to my topic of contact scraping.


19. Comparing Price Indices of Clothing and Footwear for Scanner Data and Web Scraped Data

https://www.persee.fr/doc/estat_0336-1454_2019_num_509_1_10891

Chessa, A., and Griffioen, R. 2019. “Comparing Price Indices of Clothing and Footwear for Scanner Data and Web Scraped Data.” Economie et Statistique / Economics and Statistics 509 (1): 49–68.

This journal article is from Economie et Statistique/Economics and Statistics. The article focuses on investigating this question by comparing price indices based on web scraped and scanner data for clothing and footwear in the same webshop. The article begins by briefly describing the information contained in the scanner data and web scraped, such as year and week of sales as well as item number. Then the researchers describe the method applied to the scanner data and web scraped data of the online shop, which is “QU‐method” (Quality adjusted Unit value method). Through comparing the calculated price indices, the researchers conclude that scanner data and web scraped prices are often equal, with the latter being slightly higher on average. Numbers of web scraped product prices and products sold show remarkably high correlations. At the end, the researchers suggest that more attention is given to time series analyses and other statistical analyses of scanner data. The target audience are people who are interested in the application of web scraping and scanner data.  In terms of reading difficulty, the article has a moderate level of difficulty for people who are familiar with statistical methods for experiments; meanwhile, the article may be difficult for others to understand since the article includes specific terms such as frequency distribution. The article is relatively objective since the comparative studies like the one presented in this paper may be difficult to repeat, as the availability of both scanner data and web scraped data from the same retailer is rare. From the article, I learned that scraping an entire website benefits the accuracy of price indices calculated from web scraped data, even though it would may be time consuming. Also, I can choose to sample on specific days instead of every day. This article may not be helpful for my wikipedia article since it’s about web scraping and scanner data, which is not relevant to my topic of contact scraping.


20. Web Data Extraction, Applications and Techniques: A Survey https://www.researchgate.net/publication/295256103_Performance_Comparison_of_Web_Data_Extraction_Techniques

Ferrara, E., Meo, P., Fiumara, G., et al. 2012. “Web Data Extraction, Applications and Techniques: A Survey.” Knowledge-Based Systems 1(1): 1-35.

This journal article is from Knowledge-Based Systems. The article focuses on classifying existing approaches to web data extraction in terms of their applications. In the first part of this article, the researchers provide a classification of algorithmic techniques exploited to extract data from web pages. They organize the material by presenting first basic techniques and, subsequently, the main variants to these techniques. The second part of the article is about the applications of web data extraction systems to real-world scenarios. The researchers provide a simple classification framework in which existing applications have been grouped into two main classes (enterprise and social web applications). At the end, the researchers state that the linkage of datasets coming from independent Web platforms fuels novel scientific applications, such as cross-domain recommender systems. They also discuss possible future applications of web data extraction techniques, including bioinformatics and scientific computing, as well as web harvesting. The target audience are interested in the applications of web data extraction, especially at the enterprise level and at the social web level. In terms of reading difficulty, the article has a higher level of reading difficulty for people who are not familiar with web scraping, since the article contains specific terms such as DOM (Document Object Model) and XPath (XML Path Language) The article is relatively objective since it acknowledges the limitation of the applications of web data extraction systems to real-world scenarios; for example, the data extraction process was not capable of dealing with missing values. From the article, I learned that web data extraction techniques emerge as a key tool to perform data analysis in business process re-engineering; meanwhile, web data extraction techniques offer unprecedented opportunities to analyze human behavior at a very large scale. This article may not be helpful for my wikipedia article since it’s about web data extraction tools, which would be too complicated for my topic of contact scraping.

Privacy+ Topic: Contact scraping (2nd draft)[edit]

For broader coverage of this topic, see Web scraping and Data scraping.


(The followings is the original Wikipedia article of Contact scraping:

Contact scraping is the practice of obtaining access to a customer's e-mail account in order to retrieve contact information that is then used for marketing purposes.

The New York Times refers to the practices of Tagged, MyLife and desktopdating.net as "contact scraping".

Several commercial packages are available that implement contact scraping for their customers including ViralInviter, TrafficXplode, and TheTsunamiEffect.)


Contact scraping is one of the applications of web scraping, and the example of email scraping tools include Uipath, Import.io, and Screen Scraper. [3] The alternative web scraping tools include UzunExt, R functions, and Python Beautiful Soup. The legal issues of contact scraping is under the legality of web scraping.

Web Scraping Tools[edit]

Following web scraping tools can be used as alternatives for contact scraping:

  1. UzunExt is an approach of data scraping in which string methods and crawling process are applied to extract information without using a DOM Tree . [4]
  2. R functions data. rm() and data. rm.a() can be used as a web scraping strategy. [5]
  3. Python Beautiful Soup libraries can be used to scrape data and converted data into csv files. [6]

Legal Issues[edit]

In the United States, there exists three most commonly legal claims related to web scraping: compilation copyright infringement, violation of the Computer Fraud and Abuse Act (CFAA), and electronic trespass to chattel claims. However, the claims have been changed doctrinally, and it is uncertain whether the claims will still exist in the future. [7] For instance, the applicability of the CFAA has been narrowed due to the technical similarities between web scraping and web browsing. [11]

See also[edit]


Peer review[edit]

This is where you will complete your peer review exercise. Please use the following template to fill out your review.


General info

  • Whose work are you reviewing? (provide username) Moonstar0619
  • Link to draft you're reviewing: User:Moonstar0619/sandbox
  • By eddyd101

Lead[edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer? Yes it reflects new content.
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic? Yes, the lead section includes an introductory sentence that concisely and clearly describes the topic.
  • Does the Lead include a brief description of the article's major sections? It includes brief descriptions of all the major sections except the legal section.
  • Does the Lead include information that is not present in the article? No, the lead does not included information that is not present in the article.
  • Is the Lead concise or is it overly detailed? The Lead is concise.

Lead evaluation[edit]

The lead section is concise but could be made to read more smoothly and address all the major sections of the article.

Content[edit]

Guiding questions:

  • Is the content added relevant to the topic? Yes the content is relevant.
  • Is the content added up-to-date? Yes, the content appears to be up to date.
  • Is there content that is missing or content that does not belong? No
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics? It does not deal with an equity gap or address a topic related to historically underrepresented populations.

Content evaluation[edit]

The content is up to date and relevant, however many of the sections could be expanded.

Tone and Balance[edit]

Guiding questions:

  • Is the content added neutral? The content is neutral.
  • Are there any claims that appear heavily biased toward a particular position? No
  • Are there viewpoints that are overrepresented, or underrepresented? No
  • Does the content added attempt to persuade the reader in favor of one position or away from another? No

Tone and balance evaluation[edit]

The content is unbiased and neutral in its viewpoint. The article does not attempt to persuade the reader. The tone could be improved by editing the Lead section to be smoother.

Sources and References[edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information? Most but not all
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current? Yes
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible? Unsure
  • Check a few links. Do they work? Yes

Sources and references evaluation[edit]

New content is backed up by sources. A few sources are too close to the subject (company websites) and should be changed to secondary sources. The links I spot checked worked.

Organization[edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read? Most of the new content is clear and easy to read.
  • Does the content added have any grammatical or spelling errors? There are a few minor errors.
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic? I think the content is fairly well organized. The "web scraping" and "email scraping" tools sections could be combined.

Organization evaluation[edit]

I found the new content to be easy to understand.

Images and Media[edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic? N/a
  • Are images well-captioned? N/A
  • Do all images adhere to Wikipedia's copyright regulations? N/A
  • Are the images laid out in a visually appealing way? N/A

Images and media evaluation[edit]

No images have been added yet.

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation[edit]

NA

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete? Yes, the content enhances the article.
  • What are the strengths of the content added? The legal section provides good insight and the tools section is easy to understand and thorough.
  • How can the content added be improved? I think the Lead section could be improved so that it is smoother and matches the tone of the rest of the article.

Overall evaluation[edit]

The new content makes the article more complete. It is on the whole well written. A few sources need to be updated, some copyediting needs to be done, and the Lead section should be revised.

Peer review(Tinayyt)[edit]

This is where you will complete your peer review exercise. Please use the following template to fill out your review.

General info[edit]

  • Whose work are you reviewing? Moonstar0619
  • Link to draft you're reviewing: Contact scraping

Lead[edit]

Guiding questions:

The lead has been updated with new content, the introductory is short and concise.

  • Has the Lead been updated to reflect the new content added by your peer?
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation[edit]

Content[edit]

Guiding questions:

There are currently 2 topics, and the content seems to be unto date

  • Is the content added relevant to the topic?
  • Is the content added up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation[edit]

Tone and Balance[edit]

Guiding questions:

The tone is neutral and do to seems to be biased.

  • Is the content added neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the content added attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit]

Sources and References[edit]

Guiding questions:

Content is backup by reference.

  • Is all new content backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit]

Organization[edit]

Guiding questions:

The content added seems to clear and easy to read

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read?
  • Does the content added have any grammatical or spelling errors?
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation[edit]

Images and Media[edit]

Guiding questions: If your peer added images or media

There are no media in the article currently

  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation[edit]

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation[edit]

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation[edit]

Copy Edit[edit]

For broader coverage of this topic, see Web scraping and Data scraping.

(The followings is the original Wikipedia article of Contact scraping:

Contact scraping is the practice of obtaining access to a customer's e-mail account in order to retrieve contact information that is then used for marketing purposes.

The New York Times refers to the practices of Tagged, MyLife and desktopdating.net as "contact scraping".

Several commercial packages are available that implement contact scraping for their customers including ViralInviter, TrafficXplode, and TheTsunamiEffect.)

Contact scraping is one of the applications of web scraping, and the example of email scraping tools include Zoominfo, Skrapp.io, and Hunter.io.

Email Scraping Tools[edit]

  1. Octoparse provides an email scraping platform for three types of industries: social media data, e-commerce & retail data, and news & content curation. It offers four pricing options. For "enterprise plan", it requires users to submit requests in order to obtain pricing information.
  2. Skrapp.io provides an email scraping platform from LinkedIn integration and domain search to lead directory. It offers five pricing options and one can choose to pay monthly or annually.
  3. Hunter.io provides an email scraping platform by identifying email patterns via domain names. It offers five types of pricing strategies with options of monthly or annually bills.

Web Scraping Tools[edit]

Following web scraping tools can be used as alternatives for contact scraping:

  1. UzunExt is an approach of data scraping in which string methods and crawling process are applied to extract information without using a DOM Tree.
  2. R functions data. rm() and data. rm.a() can be used as a web scraping strategy.
  3. Python Beautiful Soup libraries can be used to scrape data and converted data into csv files.

Legal Issues[edit]

In the United States, there exists three most commonly legal claims: compilation copyright infringement, violation of the Computer Fraud and Abuse Act (CFAA), and electronic trespass to chattel. However, these claims have been changed doctrinally, and it is uncertain whether these claims will still exist in the future.  For instance, the applicability of the CFAA has been narrowed due to the technical similarities between web scraping and web browsing.

See also[edit]

General info[edit]

  • Whose work are you reviewing? (provide username) Moonstar0619
  • Link to draft you're reviewing: User:Moonstar0619/sandbox
  • By eddyd101

Lead[edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer? Yes it reflects new content.
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic? Yes, the lead section includes an introductory sentence that concisely and clearly describes the topic.
  • Does the Lead include a brief description of the article's major sections? It includes brief descriptions of all the major sections except the legal section.
  • Does the Lead include information that is not present in the article? No, the lead does not included information that is not present in the article.
  • Is the Lead concise or is it overly detailed? The Lead is concise.

Lead evaluation[edit]

The lead section is from the original article. It gives out the definition of Contact scraping. However, it does not briefly describe the new parts that are added to the article.

Content[edit]

Guiding questions:

  • Is the content added relevant to the topic?
  • Is the content added up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation[edit]

The content added is relevant to the topic and up-to-date. There could be more content on the purpose of contact scraping. For the section on Legal Issues, there could be more explanations on how laws are applied to contact scraping.

Tone and Balance[edit]

Guiding questions:

  • Is the content added neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the content added attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit]

Content added are neutral and not biased. It does not try to persuade users.

Sources and References[edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit]

All new content are backed up by a source. Three of the sources are company websites, used to describe the service they offer. Some other sources could be used as supplement. Sources are current and written by different authors. For the few links I checked, they work.

Organization[edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read?
  • Does the content added have any grammatical or spelling errors?
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation[edit]

New content added could be understood easily. However, they could be better organized. For example, "Email Scraping Tools" and "Web Scraping Tools" could probably be two smaller sections inside a larger section.

Images and Media[edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation[edit]

No images have been added yet.

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation[edit]

NA

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation[edit]

New content definitely makes the article more complete. The new content describes several methods of how contact scraping could be applied. New content could be better organized into sections, readers could understand the article more easily.

Peer review Draft 3 - SfWarriors99[edit]

This is where you will complete your peer review exercise. Please use the following template to fill out your review.

General info[edit]

Lead[edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer?
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation[edit]

The lead is concise, but could be be improved by having more information and a broader definition.

Content[edit]

Guiding questions:

  • Is the content added relevant to the topic?
  • Is the content added up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation[edit]

The content is relevant, up to date, but is missing some information.

Tone and Balance[edit]

Guiding questions:

  • Is the content added neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the content added attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation[edit]

There is a good neutral tone and the article is not over or under represented in this article.

Sources and References[edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation[edit]

There are multiple sources and current links that the author uses.

Organization[edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read?
  • Does the content added have any grammatical or spelling errors?
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation[edit]

The author does a great job of creating an initial outline, but I think adding more information and content to the existing material can improve the organization and overall article.

Images and Media[edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation[edit]

There are no images.

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation[edit]

N/A

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation[edit]

Overall, the article is trending to be strong. It has a good outline, but I think the amount of content and information needs to be added.

Peer Review- IntheHeartofTexas[edit]

Peer review[edit]

This is where you will complete your peer review exercise. Please use the following template to fill out your review.

General info[edit]

Lead[edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer? Yes
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic? It is a bit distracting, perhaps it should be in paragraph form.
  • Does the Lead include a brief description of the article's major sections? No
  • Does the Lead include information that is not present in the article? Commercial packages?
  • Is the Lead concise or is it overly detailed? It's not overly detailed, but it's not cohesive.

Lead evaluation[edit]

Perhaps you could make the lead more cohesive and more of a paragraph form

Content[edit]

Guiding questions:

  • Is the content added relevant to the topic?Yes
  • Is the content added up-to-date? Yes
  • Is there content that is missing or content that does not belong? Perhaps more about how to commercial packages contribute.
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics? No

Content evaluation[edit]

The content is good. Perhaps add more about how the contact scraping affects individuals.

Tone and Balance[edit]

Guiding questions:

  • Is the content added neutral? Yes
  • Are there any claims that appear heavily biased toward a particular position? no
  • Are there viewpoints that are overrepresented, or underrepresented? no
  • Does the content added attempt to persuade the reader in favor of one position or away from another? no

Tone and balance evaluation[edit]

The tone is objective and neutral.

Sources and References[edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information? Yes
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic? Yes
  • Are the sources current? Yes
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible? Relatively
  • Check a few links. Do they work? Some links are included

Sources and references evaluation[edit]

The sources were good. Perhaps just redo some where the links weren't included.

Organization[edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read? Yes
  • Does the content added have any grammatical or spelling errors? None discovered
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic? No

Organization evaluation[edit]

Perhaps make the lead more paragraph from

Images and Media[edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic? no
  • Are images well-captioned? n/a
  • Do all images adhere to Wikipedia's copyright regulations? n/a
  • Are the images laid out in a visually appealing way? n/a

Images and media evaluation[edit]

n/a

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject? yes
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject? no very exhaustive. in progress
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles? yes
  • Does the article link to other articles so it is more discoverable? yes

New Article Evaluation[edit]

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete? yes
  • What are the strengths of the content added? The content, tone, and relevant info tagged.
  • How can the content added be improved? Expand on how it affects individuals maybe

Overall evaluation[edit]

This article was well written and the content was good. Perhaps expand on the commercial packages and how your topic affects individuals.

Copy Edit[edit]

For broader coverage of this topic, see Web scraping and Data scraping.

(The followings is the original Wikipedia article of Contact scraping:

Contact scraping is the practice of obtaining access to a customer's e-mail account in order to retrieve contact information that is then used for marketing purposes.

The New York Times refers to the practices of Tagged, MyLife and desktopdating.net as "contact scraping".

Several commercial packages are available that implement contact scraping for their customers including ViralInviter, TrafficXplode, and TheTsunamiEffect.)

Contact scraping is one of the applications of web scraping, and the example of email scraping tools include Zoominfo, Skrapp.io, and Hunter.io.

Email Scraping Tools[edit]

  1. Octoparse provides an email scraping platform for three types of industries: social media data, e-commerce & retail data, and news & content curation. It offers four pricing options. For "enterprise plan", it requires users to submit requests in order to obtain pricing information.
  2. Skrapp.io provides an email scraping platform from LinkedIn integration and domain search to lead directory. It offers five pricing options and one can choose to pay monthly or annually.
  3. Hunter.io provides an email scraping platform by identifying email patterns via domain names. It offers five types of pricing strategies with options of monthly or annually bills.

Web Scraping Tools[edit]

Following web scraping tools can be used as alternatives for contact scraping:

  1. UzunExt is an approach of data scraping in which string methods and crawling process are applied to extract information without using a DOM Tree.
  2. R functions data. rm() and data. rm.a() can be used as a web scraping strategy.
  3. Python Beautiful Soup libraries can be used to scrape data and converted data into csv files.

Legal Issues[edit]

In the United States, there exists three most commonly legal claims: compilation copyright infringement, violation of the Computer Fraud and Abuse Act (CFAA), and electronic trespass to chattel. However, these claims have been changed doctrinally, and it is uncertain whether these claims will still exist in the future.  For instance, the applicability of the CFAA has been narrowed due to the technical similarities between web scraping and web browsing.

See also[edit]

Peer review (imakespaghetti29)[edit]

General info[edit]

  • Whose work are you reviewing? Moonstar0619
  • Link to draft you're reviewing: Link

Lead[edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer?
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?
  • Does the Lead include a brief description of the article's major sections?
  • Does the Lead include information that is not present in the article?
  • Is the Lead concise or is it overly detailed?

Lead evaluation: The Lead does include an introductory sentence that concisely and clearly describes the article's topic. The Lead does include a brief description of the article's major sections. The Lead does not include information that is not present in the article. The Lead is concise and not overly detailed.

Content[edit]

Guiding questions:

  • Is the content added relevant to the topic?
  • Is the content added up-to-date?
  • Is there content that is missing or content that does not belong?
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?

Content evaluation: The content added is relevant to the topic and up-to-date. There is no content that doesn't belong, but the author can definitely go more into detail on some topics. The article does not deal with any of Wikipedia's equity gaps, nor does it address topics related to historically underrepresented populations or topics.

Tone and Balance[edit]

Guiding questions:

  • Is the content added neutral?
  • Are there any claims that appear heavily biased toward a particular position?
  • Are there viewpoints that are overrepresented, or underrepresented?
  • Does the content added attempt to persuade the reader in favor of one position or away from another?

Tone and balance evaluation: The content added is neutral and there are no claims that appear heavily biased toward a particular position. No viewpoints are under or overrepresented, and the content does not attempt to persuade the reader in favor of one position or away from another.

Sources and References[edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information?
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?
  • Are the sources current?
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?
  • Check a few links. Do they work?

Sources and references evaluation: Most of the content added is backed up by a reliable secondary source of information. The sources are thorough, but more sources need to be added to reflect the available literature on the topic. The sources are current and up-to-date. There aren't enough sources added yet to judge if they've written by a diverse spectrum of authors or if they include historically marginalized individuals where possible. I checked a few links, and they work.

Organization[edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read?
  • Does the content added have any grammatical or spelling errors?
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic?

Organization evaluation: The content added is well-written, concise, clear and easy to read. There are no grammatical or spelling errors. The content added is well-organized and broken down into appropriate sections that reflect the major points of the topic.

Images and Media[edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic?
  • Are images well-captioned?
  • Do all images adhere to Wikipedia's copyright regulations?
  • Are the images laid out in a visually appealing way?

Images and media evaluation: The author hasn't added any images yet.

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?
  • Does the article link to other articles so it is more discoverable?

New Article Evaluation: This article is not a new article.

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation: The content added will definitely improve the overall quality of the article and make the article more complete. The strengths of the content added is reliability (backed up by peer-reviewed academic journal sources) and visibility (links to many other articles). The content added can be improved by elaborating further and going more into detail into the ideas introduced. All the best!


Peer review by Niangao[edit]

This is where you will complete your peer review exercise. Please use the following template to fill out your review.

General info[edit]

  • Whose work are you reviewing? (provide username) Moonstar
  • Link to draft you're reviewing:

Lead[edit]

Guiding questions:

  • Has the Lead been updated to reflect the new content added by your peer?yes
  • Does the Lead include an introductory sentence that concisely and clearly describes the article's topic?yes
  • Does the Lead include a brief description of the article's major sections?yes
  • Does the Lead include information that is not present in the article?no
  • Is the Lead concise or is it overly detailed?concise
  • Lead evaluation

Content[edit]

Guiding questions:

  • Is the content added relevant to the topic?yes
  • Is the content added up-to-date?yes
  • Is there content that is missing or content that does not belong?no
  • Does the article deal with one of Wikipedia's equity gaps? Does it address topics related to historically underrepresented populations or topics?no

Content evaluation[edit]

Tone and Balance[edit]

Guiding questions:

  • Is the content added neutral?yes
  • Are there any claims that appear heavily biased toward a particular position?no
  • Are there viewpoints that are overrepresented, or underrepresented?no
  • Does the content added attempt to persuade the reader in favor of one position or away from another?no

Tone and balance evaluation[edit]

Sources and References[edit]

Guiding questions:

  • Is all new content backed up by a reliable secondary source of information?yes
  • Are the sources thorough - i.e. Do they reflect the available literature on the topic?yes
  • Are the sources current?yes
  • Are the sources written by a diverse spectrum of authors? Do they include historically marginalized individuals where possible?yes
  • Check a few links. Do they work?yes

Sources and references evaluation[edit]

Organization[edit]

Guiding questions:

  • Is the content added well-written - i.e. Is it concise, clear, and easy to read?concise and easy to read
  • Does the content added have any grammatical or spelling errors?no
  • Is the content added well-organized - i.e. broken down into sections that reflect the major points of the topic?yes

Organization evaluation[edit]

Images and Media[edit]

Guiding questions: If your peer added images or media

  • Does the article include images that enhance understanding of the topic?no
  • Are images well-captioned?no
  • Do all images adhere to Wikipedia's copyright regulations?no
  • Are the images laid out in a visually appealing way?no

Images and media evaluation[edit]

For New Articles Only[edit]

If the draft you're reviewing is a new article, consider the following in addition to the above.

  • Does the article meet Wikipedia's Notability requirements - i.e. Is the article supported by 2-3 reliable secondary sources independent of the subject?yes
  • How exhaustive is the list of sources? Does it accurately represent all available literature on the subject?yes
  • Does the article follow the patterns of other similar articles - i.e. contain any necessary infoboxes, section headings, and any other features contained within similar articles?yes
  • Does the article link to other articles so it is more discoverable?yes

New Article Evaluation[edit]

Overall impressions[edit]

Guiding questions:

  • Has the content added improved the overall quality of the article - i.e. Is the article more complete?yes
  • What are the strengths of the content added?
  • How can the content added be improved?

Overall evaluation[edit]

Overall, this is a great article. It is clearly structured and well-developed on different aspect. the tone is neutral. The only suggestion would be to add some pictures to help the understading of your article.

Review (Leadership team)[edit]

Hi Moonstar0619, you draft looks pretty good. (I reviewed the draft at the top of this sandbox page, please let me know if you have a more updated article :)). I think the presented information is pretty clear and I like your "see also" section which introduces more relevant topics to your readers. I notice you also have multiple hyperlinks and in-text citations in your article, which is really nice. In general, I think this is a pretty good draft. Here are some specific suggestions:

  • I didn't see the reference section in the draft in your sandbox, please add that section if you don't have it and make sure that you use 20+ peer-reviewed article as sources.
  • It may also help to include the source(which article...) for this sentence "The New York Times refers to the practices of Tagged, MyLife and desktopdating.net as "contact scraping".
  • Also, it may help organize the page if the structure of the article is clearer. For example, you could have "Web Scraping Tools" and "legal issues" as separate sections. You could also refer to Internet Privacy page for the structure of article.
  • I think the draft already contains a lot of information on the topic, but I personal think you could add a few more sections, such as "Examples"and "Defense" to further introduce the topic to your readers. I didn't read all the peer-reviewed articles as you do, so I'm not that familiar with the sub-topics on this issues. If you would like to further improve your article, you can also refer to similar webpages, such as Web scraping.
  • I like your "legal issues" section and I'm glad that you bring this issue up. One suggestion would be to include more cases from other countries as well in this section since this topic is not about contact scrapping in the U.S., so including cases in other regions and countries could help give it a more comprehensive review.

In general, I think you have a really nice draft on the topic with some detailed information. Good job! Good luck on your final symposium and final article!

  1. ^ Typing In an E-Mail Address, and Giving Up Your Friends’ as Well
  2. ^ 'Viral inviters' want your e-mail contact list
  3. ^ a b "Web Scraping", SpringerReference, Berlin/Heidelberg: Springer-Verlag, retrieved 2020-11-03
  4. ^ a b Uzun, E. (2020). "A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages". IEEE Access. 8: 61726–61740. doi:10.1109/ACCESS.2020.2984503. ISSN 2169-3536.
  5. ^ a b Vallone, A., Coro, C. and Beatriz, S. (2020). "Strategies to access web-enabled urban spatial data for socioeconomic research using R functions". Journal of Geographical Systems: Spatial Theory, Models, Methods, and Data. 22(2): 217–34.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  6. ^ a b Vela, Belen; Cavero, Jose Maria; Caceres, Paloma; Cuesta, Carlos E. (2019). "A Semi-Automatic Data–Scraping Method for the Public Transport Domain". IEEE Access. 7: 105627–105637. doi:10.1109/access.2019.2932197. ISSN 2169-3536.
  7. ^ a b c Hirschey, Jeffrey (2014). "Symbiotic Relationships: Pragmatic Acceptance of Data Scraping". SSRN Electronic Journal. doi:10.2139/ssrn.2419167. ISSN 1556-5068.
  8. ^ "Internet Law, Ch. 06: Trespass to Chattels". www.tomwbell.com. Retrieved 2020-11-12.
  9. ^ Beckham, J. Brian (2003). "Intel v. Hamidi: Spam as a Trespass to Chattels - Deconstruction of a Private Right of Action in California". The John Marshall Journal of Information Technology & Privacy Law. 22: 205–228. {{cite journal}}: line feed character in |title= at position 50 (help)
  10. ^ "FAQ about linking – Are website terms of use binding contracts?". www.chillingeffects.org. 2007-08-20. Archived from the original on 2002-03-08. Retrieved 2007-08-20.
  11. ^ a b Christensen, J. (2020). "The Demise of the Cfaa in Data Scraping Cases". Notre Dame Journal of Law, Ethics & Public Policy. 34(2): 529–47.
  12. ^ "Controversy Surrounds 'Screen Scrapers': Software Helps Users Access Web Sites But Activity by Competitors Comes Under SCrutiny". Findlaw. Retrieved 2020-11-12.
  13. ^ Philip H. Liu, Mark Edward Davis (2015–16). "Web Scraping - Limits on Free Samples". Landslide. 8.{{cite journal}}: CS1 maint: date format (link)
  14. ^ Tomáš Pikulíka, Peter Štarchoň (2020). "Public registers with personal data under scrutiny of DPA regulators". Procedia Computer Science. 170: 1174–1179.
  15. ^ Oxford Analytica (2019). "Europe's national regulators hold key to GDPR success". Expert Briefings.
  16. ^ Infrastructure. "Spam Act 2003". www.legislation.gov.au. Retrieved 2020-12-01.
  17. ^ Torresan, Danielle (2013). "Keeping Good Companies". informit. 65: 668–669.
  18. ^ "Unauthorised photographs on the internet — back on the Attorney-General's agenda". Internet Law Bulletin. 8. 2005.
  19. ^ Lee, Jyh-An (2018). "Hacking into China's Cybersecurity Law" (PDF). Wake Forest Law Review. 53: 57–104.
  20. ^ Li Qian, Jiang Tao (2020). "Rethinking Criminal Sanctions on Data Scraping in China Based on a Case Study of Illegally Obtaining Specific Data by Crawlers". China Legal Science. 8: 136.