DocFetcher – Fast Document Search

DocFetcher is an Open Source desktop search application: It allows you to search the contents of files on your computer. — You can think of it as Google for your local files. The application runs on Windows, Linux and OS X, and is made available under the Eclipse Public License.

The screenshot below shows the main user interface. Queries are entered in the text field at (1). The search results are displayed in the result pane at (2). The preview pane at (3) shows a text-only preview of the file currently selected in the result pane. All matches in the file are highlighted in yellow.

You can filter the results by minimum and/or maximum filesize (4), by file type (5) and by location (6). The buttons at (7) are used for opening the manual, opening the preferences and minimizing the program into the system tray, respectively.

DocFetcher requires that you create so-called indexes for the folders you want to search in. What indexing is and how it works is explained in more detail below. In a nutshell, an index allows DocFetcher to find out very quickly (in the order of milliseconds) which files contain a particular set of words, thereby vastly speeding up searches. The following screenshot shows DocFetcher’s dialog for creating new indexes:

Clicking on the “Run” button on the bottom right of this dialog starts the indexing. The indexing process can take a while, depending on the number and sizes of the files to be indexed. A good rule of thumb is 200 files per minute.

While creating an index takes time, it has to be done only once per folder. Also, updating an index after the folder’s contents have changed is much faster than creating it — it usually takes only a couple of seconds.

  • A portable version: There is a portable version of DocFetcher that runs on Windows, Linux and OS X. How this is useful is described in more detail further down this page.
  • 64-bit support: Both 32-bit and 64-bit operating systems are supported.
  • Unicode support: DocFetcher comes with rock-solid Unicode support for all major formats, including Microsoft Office, OpenOffice.org, PDF, HTML, RTF and plain text files.
  • Archive support: DocFetcher supports the following archive formats: zip, 7z, rar, and the whole tar.* family. The file extensions for zip archives can be customized, allowing you to add more zip-based archive formats as needed. Also, DocFetcher can handle an unlimited nesting of archives (e.g. a zip archive containing a 7z archive containing a rar archive… and so on).
  • Search in source code files: The file extensions by which DocFetcher recognizes plain text files can be customized, so you can use DocFetcher for searching in any kind of source code and other text-based file formats. (This works quite well in combination with the customizable zip extensions, e.g. for searching in Java source code inside Jar files.)
  • Outlook PST files: DocFetcher allows searching for Outlook emails, which Microsoft Outlook typically stores in PST files.
  • Detection of HTML pairs: By default, DocFetcher detects pairs of HTML files (e.g. a file named “foo.html” and a folder named “foo_files”), and treats the pair as a single document. This feature may seem rather useless at first, but it turned out that this dramatically increases the quality of the search results when you’re dealing with HTML files, since all the “clutter” inside the HTML folders disappears from the results.
  • Regex-based exclusion of files from indexing: You can use regular expressions to exclude certain files from indexing. For example, to exclude Microsoft Excel files, you can use a regular expression like this: .*.xls
  • Mime-type detection: You can use regular expressions to turn on “mime-type detection” for certain files, meaning that DocFetcher will try to detect their actual file types not just by looking at the filename, but also by peeking into the file contents. This comes in handy for files that have the wrong file extension.
  • Powerful query syntax: In addition to basic constructs like OR, AND and NOT DocFetcher also supports, among other things: Wildcards, phrase search, fuzzy search (“find words that are similar to…”), proximity search (“these two words should be at most 10 words away from each other”), boosting (“increase the score of documents containing…”)
  • Microsoft Office (doc, xls, ppt)
  • Microsoft Office 2007 and newer (docx, xlsx, pptx, docm, xlsm, pptm)
  • Microsoft Outlook (pst)
  • OpenOffice.org (odt, ods, odg, odp, ott, ots, otg, otp)
  • Portable Document Format (pdf)
  • EPUB (epub)
  • HTML (html, xhtml, …)
  • TXT and other plain text formats (customizable)
  • Rich Text Format (rtf)
  • AbiWord (abw, abw.gz, zabw)
  • Microsoft Compiled HTML Help (chm)
  • MP3 Metadata (mp3)
  • FLAC Metadata (flac)
  • JPEG Exif Metadata (jpg, jpeg)
  • Microsoft Visio (vsd)
  • Scalable Vector Graphics (svg)

In comparison to other desktop search applications, here’s where DocFetcher stands out:

Crap-free: We strive to keep DocFetcher’s user interface clutter- and crap-free. No advertisement or “would you like to register…?” popups. No useless stuff is installed in your web browser, registry or anywhere else in your system.

Privacy: DocFetcher does not collect your private data. Ever. Anyone in doubt about this can check the publicly accessible source code.

Free forever: Since DocFetcher is Open Source, you don’t have to worry about the program ever becoming obsolete and unsupported, because the source code will always be there for the taking. Speaking of support, have you gotten the news that Google Desktop, one of DocFetcher’s major commercial competitors, was discontinued in 2011? Well…

Cross-platform: Unlike many of its competitors, DocFetcher does not only run on Windows, but also on Linux and OS X. Thus, if you ever feel like moving away from your Windows box and on to Linux or OS X, DocFetcher will be waiting for you on the other side.

Portable: One of DocFetcher’s greatest strengths is its portability. Basically, with DocFetcher you can build up a complete, fully searchable document repository, and carry it around on your USB drive. More on that in the next section.

Indexing only what you need: Among DocFetcher’s commercial competitors, there seems to be a tendency to nudge users towards indexing the entire hard drive — perhaps in an attempt to take away as many decisions as possible from supposedly “dumb” users, or worse, in an attempt to harvest more user data. In practice though, it seems safe to assume that most people don’t want to have their entire hard drive indexed: Not only is this a waste of indexing time and disk space, but it also clutters the search results with unwanted files. Hence, DocFetcher indexes only the folders you explicitly want to be indexed, and on top of that you’re provided with a multitude of filtering options.

One of DocFetcher’s outstanding features is that it is available as a portable version which allows you to create a portable document repository — a fully indexed and fully searchable repository of all your important documents that you can freely move around.

Usage examples: There are all kinds of things you can do with such a repository: You can carry it with you on a USB drive, burn it onto a CD-ROM for archiving purposes, put it in an encrypted volume (recommended: TrueCrypt), synchronize it between multiple computers via a cloud storage service like DropBox, etc. Better yet, since DocFetcher is Open Source, you can even redistribute your repository: Upload it and share it with the rest of the world if you want.

Java: Performance and portability: One aspect some people might take issue with is that DocFetcher was written in Java, which has a reputation of being “slow”. This was indeed true ten years ago, but since then Java’s performance has seen much improvement, according to Wikipedia. Anyways, the great thing about being written in Java is that the very same portable DocFetcher package can be run on Windows, Linux and OS X — many other programs require using separate bundles for each platform. As a result, you can, for example, put your portable document repository on a USB drive and then access it from any of these operating systems, provided that a Java runtime is installed.

This section tries to give a basic understanding of what indexing is and how it works.

The naive approach to file search: The most basic approach to file search is to simply visit every file in a certain location one-by-one whenever a search is performed. This works well enough for filename-only search, because analyzing filenames is very fast. However, it wouldn’t work so well if you wanted to search the contents of files, since full text extraction is a much more expensive operation than filename analysis.

Index-based search: That’s why DocFetcher, being a content searcher, takes an approach known as indexing: The basic idea is that most of the files people need to search in (like, more than 95%) are modified very infrequently or not at all. So, rather than doing full text extraction on every file on every search, it is far more efficient to perform text extraction on all files just once, and to create a so-called index from all the extracted text. This index is kind of like a dictionary that allows quickly looking up files by the words they contain.

Telephone book analogy: As an analogy, consider how much more efficient it is to look up someone’s phone number in a telephone book (the “index”) instead of calling every possible phone number just to find out whether the person on the other end is the one you’re looking for. — Calling someone over the phone and extracing text from a file can both be considered “expensive operations”. Also, the fact that people don’t change their phone numbers very frequently is analogous to the fact that most files on a computer are rarely if ever modified.

Index updates: Of course, an index only reflects the state of the indexed files when it was created, not necessarily the latest state of the files. Thus, if the index isn’t kept up-to-date, you could get outdated search results, much in the same way a telephone book can become out of date. However, this shouldn’t be much of a problem if we can assume that most of the files are rarely modified. Additionally, DocFetcher is capable of automatically updating its indexes: (1) When it’s running, it detects changed files and updates its indexes accordingly. (2) When it isn’t running, a small daemon in the background will detect changes and keep a list of indexes to be updated; DocFetcher will then update those indexes the next time it is started. And don’t you worry about the daemon: It has really low CPU usage and memory footprint, since it does nothing except noting which folders have changed, and leaves the more expensive index updates to DocFetcher.

Source

10 Open Source Web Crawlers: Best List

As you are searching for the best open source web crawlers, you surely know they are a great source of data for analysis and data mining.

Internet crawling tools are also called web spiders, web data extraction software, and website scraping tools.

The majority of them are written in Java, but there is a good list of free and open code data extracting solutions in C#, C, Python, PHP, and Ruby. You can download them on Windows, Linux, Mac or Android.

Web content scraping applications can benefit your business in many ways. They collect content from different public websites and deliver the data in a manageable format. They help you monitoring news, social media, images, articles, your competitors, and etc.

On this page:

  • 10 of the best open source web crawlers.
  • How to choose open source web scraping software? (with an Infographic in PDF)

1. Scrapy

Scrapy is an open source and collaborative framework for data extracting from websites. It is a fast, simple but extensible tool written in Python. Scrapy runs on Linux, Windows, Mac, and BSD.

It extracting structured data that you can use for many purposes and applications such as data mining, information processing or historical archival.

Scrapy was originally designed for web scraping. However, it is also used to extract data using APIs or as a web crawler for general purposes.

Key features and benefits:

  • Built-in support for extracting data from HTML/XML sources using extended CSS selectors and XPath expressions.
  • Generating feed exports in multiple formats (JSON, CSV, XML).
  • Built on Twisted
  • Robust encoding support and auto-detection.
  • Fast and simple.

2. Heritrix

Heritrix is one of the most popular free and open-source web crawlers in Java. Actually, it is an extensible, web-scale, archival-quality web scraping project.

Heritrix is a very scalable and fast solution. You can crawl/archive a set of websites in no time. In addition, it is designed to respect the robots.txt exclusion directives and META robots tags.

Runs on Linux/Unixlike and Windows.

Key features and benefits:

  • HTTP authentication
  • NTLM Authentication
  • XSL Transformation for link extraction
  • Search engine independence
  • Mature and stable platform
  • Highly configurable
  • Runs from any machine

3. WebSphinix

WebSphinix is a great easy to use personal and customizable web crawler. It is designed for advanced web users and Java programmers allowing them to crawl over a small part of the web automatically.

This web data extraction solution also is a comprehensive Java class library and interactive development software environment. WebSphinix includes two parts: the Crawler Workbench and the WebSPHINX class library.

The Crawler Workbench is a good graphical user interface that allows you to configure and control a customizable web crawler. The library provides support for writing web crawlers in Java.

WebSphinix runs on Windows, Linux, Mac, and Android IOS.

Key features and benefits:

  • Visualize a collection of web pages as a graph
  • Concatenate pages together for viewing or printing them as a single document
  • Extract all text matching a certain pattern.
  • Tolerant HTML parsing
  • Support for the robot exclusion standard
  • Common HTML transformations
  • Multithreaded Web page retrieval

4. Apache Nutch

When it comes to best open source web crawlers, Apache Nutch definitely has a top place in the list. Apache Nutch is popular as a highly extensible and scalable open source code web data extraction software project great for data mining.

Nutch can run on a single machine but a lot of its strength is coming from running in a Hadoop cluster.

Many data analysts and scientists, application developers, and web text mining engineers all over the world use Apache Nutch.

Apache Nutch is a cross-platform solution written in Java.

Key features and benefits:

  • Fetching and parsing are done separately by default
  • Supports a wide variety of document formats: Plain Text, HTML/XHTML+XML, XML, PDF, ZIP and many others
  • Uses XPath and namespaces to do the mapping
  • Distributed filesystem (via Hadoop)
  • Link-graph database
  • NTLM authentication

5. Norconex

A great tool for those who are searching open source web crawlers for enterprise needs.

Norconex allows you to crawl any web content. You can run this full-featured collector on its own, or embed it in your own application.

Works on any operating system. Can crawl millions on a single server of average capacity. In addition, it has many content and metadata manipulation options. Also, it can extract page “featured” image.

Key features and benefits:

  • Multi-threaded
  • Supports different hit interval according to different schedules
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents
  • Supports pages rendered with JavaScript
  • Language detection
  • Translation support
  • Configurable crawling speed
  • Detects modified and deleted documents
  • Supports external commands to parse or manipulate documents
  • Many others

6. BUbiNG

BUbiNG will surprise you. It is a next-generation open source web crawler. BUbiNG is a Java fully distributed crawler (no central coordination). It is able to crawl several thousands pages per second. Collect really big datasets.

BUbiNG distribution is based on modern high-speed protocols so to achieve very high throughput.

BUbiNG provides massive crawling for the masses. It is completely configurable, extensible with little efforts and integrated with spam detection.

Key features and benefits:

  • High parallelism
  • Fully distributed
  • Uses JAI4J, a thin layer over JGroups that handles job assignment.
  • Detects (presently) near-duplicates using a fingerprint of a stripped page
  • Fast
  • Massive crawling.

7. GNU Wget

GNU Wget is a free and open source software tool written in C for retrieving files using HTTP, HTTPS, FTP, and FTPS.

The most distinguishing feature is that GNU Wget has NLS-based message files for many different languages. In addition, it can optionally convert absolute links in downloaded documents to relative documents.

Runs on most UNIX-like operating systems as well as Microsoft Windows. GNU Wget is a powerful website scraping tool with a variety of features.

Key features and benefits:

  • Can resume aborted downloads, using REST and RANGE
  • Can use filename wild cards and recursively mirror directories
  • Supports HTTP proxies
  • Supports HTTP cookies
  • Supports persistent HTTP connections
  • Unattended / background operation

8. Arachnode.net

Arachnode.net is for those who are looking for open source web crawlers in is a C#. Arachnode.net is a class library which downloads content from the internet, indexes this content and provides methods to customize the process.

You can use the tool for personal content aggregation or you can use the tool for extracting, collecting and parse downloaded content into multiple forms. Discovered content is indexed and stored in Lucene.NET indexes.

Arachnode.net is a good software solution for text mining purposes as well as for learning advanced crawling techniques.

Key features and benefits:

  • .NET architecture – the most comprehensive open source C#.
  • Configurable rules and actions
  • Lucene.NET Integration
  • SQL Server and full-text indexing
  • .DOC/.PDF/.PPT/.XLS Indexing
  • HTML to XML and XHTML
  • Full JavaScript/AJAX Functionality
  • Multi-threading and throttling
  • Respectful crawling
  • Analysis services

9. OpenSearchServer

OpenSearchServer is an open source enterprise class search engine and web crawling software. It is a fully integrated and very powerful solution. One of the best solutions out there.

OpenSearchServer has one of the high rated reviews on the internet. It is packed with a full set of search functions and allows you to build your own indexing strategy.

The web crawler includes inclusion or exclusion filters with wildcards, HTTP authentication, screenshot, sitemap, Etc. It is written in C, C++, and Java PHP and is a cross-platform solution.

Key features and benefits:

  • A fully integrated solution
  • The crawlers can index everything
  • Full-text, boolean and phonetic search
  • 17 language options
  • Automatic classifications
  • Scheduling for periodic tasks
  • Parsing: Office documents ( such as Word, Excel, Powerpoint), OpenOffice documents, PDF files, Web pages (HTML), RTF, plain text, audio files, metadata images and etc.

10. Nokogiri

If you use Ruby, Nokogiri could be your solution. Nokogiri can transform a webpage into a ruby object. In addition, it makes all the web crawling process really easy and simple.

Nokogiri is an HTML, XML, SAX, and Reader parser. It has many features and the ability to search documents via XPath or CSS3 selectors is one of the best.

Nokogiri is a large library and provides example usages for parsing and examining a document. This data extraction software runs on Windows, Linux, Mac OS, Ubuntu.

Key features and benefits:

  • XML/HTML DOM parser which handles broken HTML
  • XML/HTML SAX parser
  • XML/HTML Push parser
  • XPath 1.0 support for document searching
  • CSS3 selector support for document searching
  • XML/HTML builder
  • XSLT transformer

How to choose the best open source website crawler?

Crawling or scraping data software tools are becoming more and more popular. Hundreds of options have become available with different functionality and scalability.

Choosing the right option can be a tricky business. Here are some tips to help you find out the right open source web scraping software for your needs.

  • Scalability

The web data extraction solution that you choose should be scalable. If your data needs are growing, the crawling tool shouldn’t slow you down. Your future data requirements should be covered.

This means the website crawler architecture should permit adding extra machines and bandwidth to handle future scaling up.

  • Distributed web crawling

It means all downloaded pages have to be distributed among many computers (even hundreds of computers) in fraction of seconds.

In other words, the web data extraction software should have the capability to perform in a distributed way across multiple machines.

  • Robustness

Robustness refers to the web scraper ability to not get trapped in a large number of pages.

Website scrapers must be stable and not fall in the trap generated by many web servers which trick the crawlers to stop working while fetching an enormous number of pages in a domain.

  • Politeness

Politeness is a must for all of the open source web crawlers. Politeness means spiders and crawlers must not harm the website. To be polite a web crawler should follow the rules identified in the website’s robots.txt file.

Also, your web crawler should have Crawl-Delay and User-Agent header. Crawl-Delay refers to stopping the bot from scraping website very frequently. When a website has too many requests that the server cannot handle, they become unresponsive and overloaded.

User-Agent header allows you to include your contact details (such as email and website) in it. Thus the website owner will contact you in case you are ignoring the core rules.

  • Extensible

Open source web crawlers should be extensible in many terms. They have to handle new fetch protocols, new data formats, and etc. In other words, the crawler architecture should be modular.

  • Data delivery formats

Ask yourself what data delivery formats you need. Do you need JSON format? Then choose a web data extraction software that delivers the data in JSON. Of course, the best choice is to find one that delivers data in multiple formats.

  • Data quality

As you might know, the scraped data is initially unstructured data (see unstructured data examples). You need to choose a software capable of cleaning the unstructured data and presenting it in a readable and manageable manner.

It doesn’t need to be a data cleansing software but should take care of cleaning up and classifying the initial data into useful data for you.

Conclusion

Scraping or extracting information from a website is an approach applied by a number of businesses that need to collect a large volume of data related to a particular subject.

All of the open source web crawlers have their own advantage as well as cons.

You need to carefully evaluate the web scrapers and then choose one according to your needs and requirement.

For example, Scrapy is faster and very easy to use but it is not as scalable as Heritrix, BUbiNG, and Nutch. Scrapy is also an excellent choice for those who aim focused crawls.

Heritrix is scalable and performs well in a distributed environment. However, it is not dynamically scalable. On the other hand, Nutch is very scalable and also dynamically scalable through Hadoop. Nokogiri can be a good solution for those that want open source web crawlers in Ruby. And etc.

If you need more open source solution related to data, then our posts about best open source data visualization software and best open source data modeling tools, might be useful for you.

Which are your favorite open source web crawlers? What data do you wish to extract?

Download the following infographic in PDF:

Tips to choose the best open source web crawlers - infographic

mm

Silvia Vylcheva has more than 10 years of experience in the digital marketing world – which gave her a wide business acumen and the ability to identify and understand different customer needs.

Silvia has a passion and knowledge in different business and marketing areas such as inbound methodology, data intelligence, competition research and more.

Source

5 Best Private Search Engines

Using a private search engine such as StartPage or DuckDuckGo is becoming ever more important. These usually leverage the big search engines in order to return results, but proxy search requests so that Google or Yahoo or Microsoft do not know who did the search. In other words, these see only that the search query came from the privacy search engine.

These privacy search engines promise not to log your IP address or any searches you make. Does this sound good to you? Good. The next question, then, is which privacy search engine to use…

Here are the best private search engines that are anonymous and make a great Google alternative.

  1. DuckDuckGo – The most popular private search engine
  2. SearX – An open-source search engine that guarantees no logs
  3. Disconnect Search – Privacy-oriented extension that works with your favorite browser
  4. StartPage – A user-friendly search engine if you don’t mind ads
  5. Peekier – A service to watch thanks to its radically different approach

Keep reading this guide to learn more about each private search engine in-depth.

The problem with most search engines is that they spy on you. This is their business model – to learn as much about you as possible, to deliver highly targeted advertising directly to your browser window. 

Google has even recently dropped its moratorium on combining what it learns by scanning your emails with what it learns about you through your searches. All the better to spy on you. Information typically collected and stored each time you make a search includes:

  • Your IP address
  • Date and time of query
  • Query search terms
  • Cookie ID – this cookie is deposited in your browser’s cookie folder, and uniquely identifies your computer. With it, a search engine provider can trace a search request back to your computer.

This information is usually transmitted to the requested web page, and to the owners of any third party advertising banners displayed on that page. As you surf the internet, advertisers build up a (potentially highly embarrassing) profile of you.

Of course, if Google, Microsoft, and Yahoo!, etc., know lots about you, this information can be (and often is) handed over to the police and the NSA. So it’s a good time to get a Google alternative.

Indeed, it was only recently that evidence emerged showing Yahoo works with hand in glove with the NSA to betray its users to the intelligence service. Naughty, naughty.

Google Transparency Report - User Data Requests

Google Transparency Report on the number of User Data Requests received, and the number (at least partially) acceded to

An added benefit of using a search engine that does not track you is that it avoids the “filter bubble” effect. Most search engines use your past search terms (and things you “Like” on social networks) to profile you. They can then return results they think will interest you.

This can result in only receiving search returns that agree with your point of view, and this locks you into a “filter bubble,” where you do not get to see alternative viewpoints and opinions because they have been downgraded in your search results.

Not only does this deny you access to the rich texture and multiplicity of human input, but it can also be hazardous as it can confirm prejudices, and prevent you from seeing the “bigger picture”.

1. DuckDuckGo

DuckDuckGo

In a world governed by tracking, DuckDuckGo promises to uphold your privacy!

DuckDuckGo is “The Search Engine that Vows Not to Track You”. Gabriel Weinberg, the CEO and founder of DuckDuckGo, has stated that “if the FBI comes to us, we have nothing to tie back to you.”

It is a US-based company and is the most popular and high-profile of the privacy search engines. Searches are primarily sourced via Yahoo, with whom DuckDuckGo has a strong relationship.

This is very worrying given recent revelations about its ties to the NSA, but DuckDuckGo continues to promise that it does not collect or share personal information.

Search results

  • DuckDuckGo offers search suggestions as you type in a query.
  • Search returns are speedy.
  • This includes image and video search returns.
  • Presentation of results is very clear.
  • Search filter categories include Web, Images, Videos, Products, Meanings, Definition, and News. Displayed filters are adaptive, and DDG will initially show results under the filter category that it feels is most appropriate to the search terms. Depending on the filter selected, DuckDuckGo may display image, video or Wikipedia previews at either the top of the search page or in a box to the right of the results.
  • Ads may also be displayed to the right of search results. Paid ads are clearly marked as such, are discreet, and are never mixed in with the “pure” search returns.
  • Image results, however, can only be filtered by size (Small, Medium. Large).
  • Video results display a thumbnail preview. YouTube videos can be played directly from DDG the website, but a warning alerts you to the fact that these will be tracked by YouTube/Google.
  • Results can also be filtered by country and date (Anytime, Past Day, Past Week or Past Month).
  • Subjectively, I find the quality of DuckDuckGo’s search returns to be very good. I have seen complaints, however, by others who do not find them as good as those from Google. This is one reason why “bangs” are so useful (see below).

Duck Duck Go Search Results

Here we can see both the contextual filter in actual (auto-direct to Products) and DDG’s discrete ads

How it makes money

DuckDuckGo displays ads alongside its search results. These are sourced from Yahoo as part of the Yahoo-Microsoft search alliance. By default, when advertisers sign up for a Bing Ads account, their ads automatically enter rotation into all of Bing’s distribution channels, including DuckDuckGo 

Importantly, however, these ads are untargeted (they are displayed based on your search terms). And as already noted, they are clearly marked and are shown separately from the “pure” search returns.

DuckDuckGo is part of the affiliate programs of Amazon and eBay. When you visit those sites through DuckDuckGo and subsequently make a purchase, it receives a small commission. No personally identifiable information is given out in this way, however, and this does not influence search result rankings.

Privacy

DuckDuckGo states that it does not collect or share personal information.

  • An affiliate code may be added to some eCommerce sites (e.g., Amazon & eBay), but this does not include any personally identifiable information.
  • Being based in the US means that DuckDuckGo is subject to government pressure and laws such as FISA and the Patriot Act. This means that the US government could mandate that DuckDuckGo start logging its users’ activities. And prevent the company from alerting users to this fact via a Gag order.
  • DuckDuckGo uses Amazon servers. Again, this is a US company, subject to pressure from the US government.
  • Qualys SSL labs security report: A+

Gabriel Weinberg, CEO of DuckDuckGo, has contacted me regarding this article, attempting to once again reassure us that DuckDuckGo is privacy-conscious and retains no data.

Features

In addition to its rather nifty contextual filters, the most striking feature of DuckDuckGo is “bangs”. 

These allow you to search other websites quickly and easily. For example, typing !guk before a search query will return Google UK search results, and typing !a will search the Amazon store for you.

Note that bangs take you to the website in question. The searches are not proxied, so you lose an element of privacy if you bang Google directly. Fortunately, there is a solution. You can combine bangs with Startpage.com (see review above) by typing !s or !sp, and because Startpage.com uses Google, you can have the best of both worlds.

My thoughts

DuckDuckGo offers good looking and easy-to-use interface, although some may prefer Google to the primarily Yahoo-based search results.

Bangs are a killer feature, however, and one that goes a long way towards compensating for this issue. Just remember that if you want to query Google and protect your privacy, it makes sense to bang into StartPage.com with the !s or !sp for Google search results in privacy instead of going to Google directly.

It is little surprise, then, that DuckDuckGo is so popular. But the fact that it is a US company should sound a note of caution.

2. SearX

SearX

SearX is versatile with public and self-hosted options – the latter of which is unrivalled in privacy

Less well-known, but fast gaining traction with the security community is SearX. Not only is SearX fully open-source, but it is easy to set up and to run your own instance of it.

There is an official public SearX instance, or you can use one of many volunteer-run public instances. But what SearX is really about is running your own instance. This makes SearX the only metasearch engine where you can be 100 percent sure that no logs are kept!

Search results

  • By default, SearX leverages results from a large number of search engines.

SearX Results Page

In Preferences, users can change which search engines are used

  • Search suggestions are not offered
  • Searches can be filtered by the following categories: General, Files, Images, IT, Map (using OpenStreetMap), Music, News, Science, Social Media, and Videos. They can also be filtered by time.
  • There are no ads.
  • Wikipedia entries are displayed to the right of search results.
  • There are no additional filters for Images, although a preview is displayed when they are clicked on.
  • Video results display a thumbnail preview. Clicking on a video takes you to the website it is hosted on (for example YouTube or Vimeo).
  • Search results can be downloaded as a .csv, .json., or rss file.
  • As with Startpage, search results can be viewed proxied. This will “break” many websites, but does allow for a very high level of privacy.
  • Search results are as good as the engine’s selected. The official instance uses Google, Bing, Wikipedia, and a host of other first-rate engines by default, so the results are excellent.

SearX Options

There are no ads, search suggestions are listed to the right, and as with Startpage, you can proxy webpages

How it makes money

SearX is an open-source project run by volunteers. On the official instance, there is no on-site advertising and no affiliate marketing.

Because it is open-source, individual operators of public SearX instances are free to introduce their own finance models. But I have yet to find a single instance that is not 100 percent ad and affiliate-free.

Privacy

  • There is no way to know if a public SearX instance operator is logging your searches. And this includes the official instance.
  • That being said, there is no way to guarantee that DDG, Startpage, or any other “private” search engines are not logging your searches either…
  • If you are serious about privacy, therefore, you should set up your own SearX instance. In fact, setting up your own SearX instance on a server that only you directly control is the only way currently available to guarantee that your searches are not logged.
  • This makes self-hosted SearX instances by far the most secure search engines available. Documentation for installing your own SearX instance is available here.
  • For the casual user, public SearX instances are unlikely to log your searches and are much less likely to be monitored by the likes of the NSA than the other services mentioned here.
  • Just remember, though, that there is no way to be sure of this.
  • Qualys SSL labs security report for searx.me (the official instance): A. Note that each SearX instance (public or private) is different in this respect.

Features

As with Startpage, the ability to proxy websites is a killer feature if you can live with it breaking many websites that you visit.

My thoughts

For serious tech-savvy privacy-heads, a self-hosted SearX instance is the way to go. Simply put, nothing else is in the same league when it comes to knowing for certain that your searches are not logged.

More casual users may also be surprised at how well the software works on public instances. My personal feelings are that these are much less likely to log your searches or be spied on by the US and other governments than DuckDuckGo, Startpage or Disconnect. But this is purely speculation.

3. Disconnect Search

Disconnect Search

Disconnect attempts to cater to all your security needs – from VPNs to browser extensions

The US-based company has made a name for itself with some excellent open-source privacy-oriented browser extensions in recent years. One of these is the open-source Disconnect Search add-on for Firefox and Chrome (a non-open source Android app is also available).

This browser add-on is still the primary way to use Disconnect Search, although a JavaScript web app is available. This mimics the browser extension and allows you to perform web searches from the Disconnect Search web page.

Disconnect also markets a Premium VPN and online security app, with Disconnect Search functionality built-in. Please see my Disconnect VPN review for more details on this.

Search results

  • Searches are usually made from the browser add-on.
  • You can select which of three search engines to query: Bing, Yahoo or DuckDuckGo (default).
  • Unlike the other privacy metasearch engines discussing this article, Disconnect does not display search returns on its own website. Results are simply routed through Disconnect’s servers to hide their origin and are then opened in the selected search engine’s webpage.
  • Incognito mode searches are supported.

browser extension

The browser extension

How it makes money

Disconnect markets a Premium product, but the Disconnect Search browser extension is free. It hides your IP when doing searches but then sends you directly to the selected search engine.

This means that Disconnect performs no advertising or affiliate marketing of its own when doing a search.

Privacy

  • Disconnect is a US company and is therefore not a good choice for the more NSA-phobic out there.
  • The browser extension is open-source, but search requests can still be logged by Disconnect, as they are made through its servers.
  • Disconnect hosts its service on Amazon servers.
  • Qualys SSL labs security report: A (this is for the Disconnect.me website).

My thoughts

The Disconnect Search browser extension provides a quick and easy way to hide your true identity while doing searches using your favorite search engine. The fact that Disconnect is US-based, however, is a major issue. 

4. StartPage

StartPage

Based in the Netherlands, StartPage enjoys strong privacy laws unlike its US competitors!

Startpage.com and Ixquick are run by the same company. In the past, Startpage.com returned Google results, while Ixquick returned results from a number of other search engines, but not Google. The two services have now been combined, and both return identical Google results.

Although no longer actively supported, the old Ixquick metasearch engine is still available at Ixquick.eu. Interestingly, despite no longer being actively supported, Startpage.com has recently removed Yahoo results from the legacy search engine. This is in response to news that Yahoo has been helping the NSA spy on its users.

Search results

  • Suggestions are not offered as you type by default, but this can be enabled in settings.
  • Search returns are fast, but perhaps not as fast as those of DuckDuckGo (this is a purely subjective assessment).
  • Presentation of results is very clear.
  • Searches can be only filtered by Web, Images and Video categories. An advanced search option is available that allows you to specify a variety of search parameters, and you can filter results by time.
  • Ads are displayed above the search results. They are clearly marked as ads and are not mixed with the “pure” search results.
  • Video results display an image preview. YouTube cannot be played directly on the Startpage website for privacy reasons and will open in a new tab. 
  • Search results are pulled directly from Google and are therefore very good.

startpage search

Ads are discrete but clearly labeled

How it makes money

Much like DuckDuckGo, Startpage.com makes money from ads and affiliate links. 

These ads are untargeted, clearly marked, and not mixed in with the “real” search returns. They are somewhat more prominently displayed than with DuckDuckGo, however.

Privacy

  • Startpage is based in the Netherlands, which has strong privacy laws.
  • It runs servers collocated in the US. These are owned and controlled by Startpage, and I am assured that they are secure against government snooping. If this worries you, however…
  • It is possible to use non-US servers only (or non-EU servers).
  • Web pages returned from searches can be proxied (see below).
  • Startpage is the only privacy search engine that has been independently audited.
  • Qualys SSL labs security report: A+

Features

Startpage.com’s killer feature is that, rather than visiting a website directly, you can proxy the connection. If you select this option, then a proxy server run by Startpage.com sits between your computer and the website.

This prevents the website from knowing your true IP address (much like a VPN), and from being able to use web tracking and fingerprinting technologies to identify and track you. It also blocks malicious scripts. 

The downside is that pages load more slowly since StartPage.com must retrieve the contents and re-display them. That said, the newly re-branded and redesigned “Anonymous View” is much faster than was previously the case. It also breaks websites much less because it allows JavaScript “while rewriting and ‘redefining’ JavaScript primitives to protect your privacy.” 

I must say that this is a terrific feature and one that can significantly improve your privacy. Given its downside, however, you probably won’t want to use it all the time.

My thoughts

With its new re-design, StartPage.com matches DuckDuckGo in terms of prettiness and user-friendliness.

But thanks to being based in the Netherlands and having nothing to do with Yahoo, it should be more resistant to NSA spying than its US-based rival (if you specify non-US servers!). And the ability to proxy web pages is an absolute doozy.

5. Peekier

Peekier

Peekier brings welcome changes to the standardized search engine.

Peekier is a new no-logs search engine. There is not enough information about this service currently available for me to give it a proper assessment. It is worth mentioning, however, because of the attractive and innovative way that it displays search results.

Peekier Search Results

In a field were where, if we are honest, most search engines look pretty similar, it is great to see a different approach. I, therefore, think it worth flagging up Peekier and keeping an eye on the service to see how it develops.

Using any of these services engines will significantly improve your search privacy. Crucially, your searches will not be recorded to help build a profile that is used to sell you stuff. All the search engines I looked at in this article are easy to use and return good results.

Will these services protect your searches from government surveillance (and the NSA in particular)? In the case of US companies, it is safest to assume not. But unless you are doing something very illegal, this may not concern you (although it should).

Startpage is non-US based, has been independently audited, and allows you to access websites with a great deal of privacy thanks to its proxy feature. It is, therefore, a much better choice for privacy-heads than DuckDuckGo.

Public SearX instances are less likely to be monitored than other higher-profile search engines, but they may be. It is also likely that you will know nothing about their operators. Running your own SearX instance on hardware directly under your control, however, is an extremely secure and private solution. And is therefore only one that I can recommend to serious privacy fanatics.

The fact the SearX has a great interface and returns on-the-button results from all the major search engines is the icing on the cake.

Source

Open source search engines – newsandimages.net

Get real time updates directly on you device, subscribe now.

– Advertisement –

The search engine is a type of program to search for information. Through search engines, we will search for different data types.We are searching for information,local to the computer or the Internet ever.The maximum demand for Internet-based search engines.Depending on the type of search data, to develop different types of search engines. At present, people are much more busy, so little time to get more information.To provide human needs, Developers are regularly upgraded search engine. For the search engines, we can find any information very quickly.There are many famous search engine in the world, Among them the most famous examples of search engines are: Google, Bing and Yahoo.To facilitate your search for information below is a list of some of the Open source search engines.We hope the list of search engines that will benefit you.

Related Posts

Open source search engines

Below Post Content Ads – 300 x 250

Source

50 Best Open Source Web Crawlers – ProWebScraper

Best Open Source Web Crawler

As an automated program or script, web crawler systematically crawls through web pages in order to work out the index of the data that it sets out to extract. In terms of the process, it is called web crawling or spidering.

You might wonder what a web crawling application or web crawler is and how it might work. Check out this video here to know more.

The tools that you use for the process are termed as web spiders, web data extraction software and website scraping tools.

The reason why web crawling applications matter so much today is because they can accelerate the growth of a business in many ways. In a data-driven world, these applications come quite handy as they collate information and content from diverse public websites and provide the same in a format that is manageable. With the help of these applications, you can keep an eye on crumbs of information scattered all over- the news, social media, images, articles, your competition etc.

In order to leverage these applications, it is needed to survey and understand the different aspects and features of the same. In this blog, we will take you through the different open source web crawling library and tools which can help you in crawling, scraping the web and parsing out the data.

We have put together a comprehensive summary of the best open source web crawling library and tools available in each language:

Open Source Web Crawler in Python:

1. Scrapy :

  • Language : Python
  • Github star : 28660
  • Support

ScrapyDescription :

  • Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
  • It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
  • Its built for extracting specific information from websites and allows you to focus on the data extraction using CSS selectors and choosing XPath expressions.
  • If you are familiar with Python you’ll be up and running in just a couple of minutes.
  • It runs on Linux, Mac OS, and Windows systems.

Features :

  • Built-in support for extracting data from HTML/XML sources using extended CSS selectors and XPath expressions.
  • Generating feed exports in multiple formats (JSON, CSV, XML).
  • Built on Twisted
  • Robust encoding support and auto-detection.
  • Fast and powerful.

– Documentation : https://docs.scrapy.org/en/latest/

– Official site : https://scrapy.org/

2. Cola :

  • Language : Python
  • Github star : 1274
  • Support

Description :

  • Cola is a high-level distributed crawling framework, used to crawl pages and extract structured data from websites.
  • It provides simple and fast yet flexible way to achieve your data acquisition objective.
  • Users only need to write one piece of code which can run under both local and distributed mode.

Features :

  • High-level distributed crawling framework
  • Simple and fast
  • Flexible

– Documentation : https://github.com/chineking/cola

– Official site : https://pypi.org/project/Cola/

3. Crawley :

  • Language : Python
  • Github star : 144
  • Support

Description :

  • Crawley is a pythonic Scraping / Crawling Framework intended to make easy the way you extract data from web pages into structured storages such as databases.

Features :

  • High Speed WebCrawler built on Eventlet.
  • Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
  • Supports NoSQL databases like Mongodb and Couchdb. New!
  • Export your data into Json, XML or CSV formats. New!
  • Command line tools.
  • Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
  • Cookie Handlers.

– Documentation : https://pythonhosted.org/crawley/

– Official site : http://project.crawley-cloud.com/

4. MechanicalSoup :

  • Language : Python
  • Github star : 2803
  • Support

Description :

  • MechanicalSoup is a python library that is designed to simulate the behavior of a human using a web browser and built around the parsing library BeautifulSoup.
  • If you need to scrape data from simple sites or if heavy scraping is not required, using MechanicalSoup is a simple and efficient method.
  • MechanicalSoup automatically stores and sends cookies, follows redirects and can follow links and submit forms.

Features :

  • Lightweight
  • Cookie Handlers.

– Documentation : https://mechanicalsoup.readthedocs.io/en/stable/

– Official site : https://mechanicalsoup.readthedocs.io/

5. PySpider :

  • Language : Python
  • Github star : 11803
  • Support

Description :

  • PySpider is a Powerful Spider(Web Crawler) System in Python.
  • It supports Javascript pages and has a distributed architecture.
  • PySpider can store the data on a backend of your choosing database such as MySQL, MongoDB, Redis, SQLite, Elasticsearch, Etc.
  • You can use RabbitMQ, Beanstalk, and Redis as message queues.

Features :

  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • Supports heavy AJAX websites.
  • Facilitates more comfortable and faster scraping

– Documentation : http://docs.pyspider.org/

– Official site : https://github.com/binux/pyspider

6. Portia :

  • Language : Python
  • Github star : 6250
  • Support

Portia

Portia

Description :

  • Portia is a visual scraping tool created by Scrapinghub that does not require any programming knowledge.
  • If you are not a developer, its best to go straight with Portia for your web scraping needs.
  • You can try Portia for free without needing to install anything, all you need to do is sign up for an account at Scrapinghub and you can use their hosted version.
  • Making a crawler in Portia and extracting web contents is very simple if you do not have programming skills.
  • You won’t need to install anything as Portia runs on the web page.
  • With Portia, you can use the basic point-and-click tools to annotate the data you wish to extract, and based on these annotations Portia will understand how to scrape data from similar pages.
  • Once the pages are detected Portia will create a sample of the structure you have created.

Features :

  • Actions such as click, scroll, wait are all simulated by recording and replaying user actions on a page.
  • Portia is great to crawl Ajax powered based websites (when subscribed to Splash) and should work fine with heavy Javascript frameworks like Backbone, Angular, and Ember.

– Documentation : https://portia.readthedocs.io/en/latest/index.html

– Official site : https://github.com/scrapinghub/portia

7. Beautifulsoup :

Beautifulsoup

Beautifulsoup

Description :

  • Beautiful Soup is a Python library designed for quick turnaround projects like web scraping
  • It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.It commonly saves programmers hours or days of work.

Features :

  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
  • Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

– Documentation : https://www.crummy.com/software/BeautifulSoup/bs4/doc/

– Official site : https://www.crummy.com/software/BeautifulSoup/

8. Spidy Web Crawler :

  • Language : Python
  • Github star : 152
  • Support

Spidy

Spidy

Description :

  • Spidy is a Web Crawler which is easy to use and is run from the command line. You have to give it a URL link of the webpage and it starts crawling away! A very simple and effective way of fetching stuff off of the web.
  • It uses Python requests to query the webpages, and lxml to extract all links from the page.Pretty simple!

Features :

  • Error Handling
  • Cross-Platform compatibility
  • Frequent Timestamp Logging
  • Portability
  • User-Friendly Logs
  • Webpage saving
  • File Zipping

– Documentation : https://github.com/rivermont/spidy

– Official site : http://project.crawley-cloud.com/

9. Grab :

  • Language : Python
  • Github star : 1627
  • Support

Description :

  • Grab is a python framework for building web scrapers.
  • With Grab you can build web scrapers of various complexity, from simple 5-line scripts to complex asynchronous website crawlers processing millions of web pages.
  • Grab provides an API for performing network requests and for handling the received content e.g. interacting with DOM tree of the HTML document.

Features :

  • HTTP and SOCKS proxy with/without authorization
  • Automatic charset detection
  • Powerful API to extract data from DOM tree of HTML documents with XPATH queries
  • Automatic cookies (session) support

– Documentation : https://grablib.org/en/latest/

– Official site : https://github.com/lorien/grab

Open Source Web Crawler Java :

10. Apache Nutch :

  • Language : Java
  • Github star : 1743
  • Support

Apache Nutch

Apache Nutch

Description :

  • Apache Nutch is a highly extensible and scalable open source web crawler software project.
  • When it comes to best open source web crawlers, Apache Nutch definitely has a top place in the list.
  • Apache Nutch is popular as a highly extensible and scalable open source code web data extraction software project great for data mining.
  • Nutch can run on a single machine but a lot of its strength is coming from running in a Hadoop cluster.
  • Many data analysts and scientists, application developers, and web text mining engineers all over the world use Apache Nutch.
  • Apache Nutch is a cross-platform solution written in Java.

Features :

  • Fetching and parsing are done separately by default
  • Supports a wide variety of document formats: Plain Text, HTML/XHTML+XML, XML, PDF, ZIP and many others
  • Uses XPath and namespaces to do the mapping
  • Distributed file system (via Hadoop)
  • Link-graph database
  • NTLM authentication

– Documentation : https://wiki.apache.org/nutch/

– Official site : http://nutch.apache.org/

11. Heritrix :

  • Language : Java
  • Github star :1236
  • Support

Description :

  • Heritrix is one of the most popular free and open-source web crawlers in Java. Actually, it is an extensible, web-scale, archival-quality web scraping project.
  • Heritrix is a very scalable and fast solution. You can crawl/archive a set of websites in no time. In addition, it is designed to respect the robots.txt exclusion directives and META robots tags.
  • Runs on Linux/Unix like and Windows.

Features :

  • HTTP authentication
  • NTLM Authentication
  • XSL Transformation for link extraction
  • Search engine independence
  • Mature and stable platform
  • Highly configurable
  • Runs from any machine

Documentation : https://github.com/internetarchive/heritrix3/wiki/Heritrix%203.0%20and%203.1%20User%20Guide

– Official site : https://github.com/internetarchive/heritrix3b

12. ACHE Crawler :

  • Language : Java
  • Github star : 154
  • Support

Ache Crawler

Ache Crawler

Description :

  • ACHE is a focused web crawler.
  • It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern.
  • ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain.
  • A page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a machine-learning based classification model. ACHE can also automatically learn how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant content.

Features :

  • Regular crawling of a fixed list of websites
  • Discovery and crawling of new relevant websites through automatic link prioritization
  • Configuration of different types of pages classifiers (machine-learning, regex, etc)
  • Continuous re-crawling of sitemaps to discover new pages
  • Indexing of crawled pages using Elasticsearch
  • Web interface for searching crawled pages in real-time
  • REST API and web-based user interface for crawler monitoring
  • Crawling of hidden services using TOR proxies

– Documentation : http://ache.readthedocs.io/en/latest/

– Official site : https://github.com/ViDA-NYU/ache

13. Crawler4j :

  • Language : Java
  • Github star : 3039
  • Support

Description :

  • crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web.
  • Using it, you can setup a multi-threaded web crawler in few minutes.

– Documentation : https://github.com/yasserg/crawler4j

– Official site : https://github.com/yasserg/crawler4j

14. Gecco :

  • Language : Java
  • Github star : 1245
  • Support

Description :

  • Gecco is a easy to use lightweight web crawler developed with java language.Gecco integriert jsoup, httpclient, fastjson, spring, htmlunit, redission ausgezeichneten framework,Let you only need to configure a number of jQuery style selector can be very quick to write a crawler.
  • Gecco framework has excellent scalability, the framework based on the principle of open and close design, to modify the closure, the expansion of open.

Features :

  • Easy to use, use jQuery style selector to extract elements
  • Support for asynchronous Ajax requests in the page
  • Support page JavaScript variable extraction
  • Using Redis to realize distributed crawling,reference gecco-redis
  • Support the development of business logic with Spring,reference gecco-spring
  • Support htmlunit extension,reference gecco-htmlunit
  • Support extension mechanism
  • Support download UserAgent random selection
  • Support the download proxy server randomly selected

– Documentation : https://github.com/xtuhcy/gecco

– Official site : https://github.com/xtuhcy/gecco

15. BUbiNG :

  • Language : Java
  • Github star :24
  • Support

Description :

  • BUbiNG will surprise you. It is a next-generation open source web crawler. BUbiNG is a Java fully distributed crawler (no central coordination). It is able to crawl several thousands pages per second. Collect really big datasets.
  • BUbiNG distribution is based on modern high-speed protocols so to achieve very high throughput.
  • BUbiNG provides massive crawling for the masses. It is completely configurable, extensible with little efforts and integrated with spam detection.

Features :

  • High parallelism
  • Fully distributed
  • Uses JAI4J, a thin layer over JGroups that handles job assignment.
  • Detects (presently) near-duplicates using a fingerprint of a stripped page
  • Fast
  • Massive crawling.

– Documentation : http://law.di.unimi.it/software/bubing-docs/index.html

– Official site : http://law.di.unimi.it/software.php#bubing

16. Narconex :

Narconex

Narconex

Description :

  • A great tool for those who are searching open source web crawlers for enterprise needs.
  • Norconex allows you to crawl any web content. You can run this full-featured collector on its own, or embed it in your own application.
  • Works on any operating system. Can crawl millions on a single server of average capacity. In addition, it has many content and metadata manipulation options. Also, it can extract page “featured” image.

Features :

  • Multi-threaded
  • Supports different hit interval according to different schedules
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents
  • Supports pages rendered with JavaScript
  • Language detection
  • Translation support
  • Configurable crawling speed
  • Detects modified and deleted documents
  • Supports external commands to parse or manipulate documents

– Documentation : http://www.norconex.com/collectors/collector-http/getting-started

– Official site : http://www.norconex.com/collectors/collector-http/

17. WebSPHINX :

  • Language : Java
  • No support Available

Sphinx

Sphinx

Description :

  • WebSphinix is a great easy to use personal and customizable web crawler. It is designed for advanced web users and Java programmers allowing them to crawl over a small part of the web automatically.
  • This web data extraction solution also is a comprehensive Java class library and interactive development software environment. WebSphinix includes two parts: the Crawler Workbench and the WebSPHINX class library.
  • The Crawler Workbench is a good graphical user interface that allows you to configure and control a customizable web crawler. The library provides support for writing web crawlers in Java.
  • WebSphinix runs on Windows, Linux, Mac, and Android IOS.

Features :

  • Visualize a collection of web pages as a graph
  • Concatenate pages together for viewing or printing them as a single document
  • Extract all text matching a certain pattern.
  • Tolerant HTML parsing
  • Support for the robot exclusion standard
  • Common HTML transformations
  • Multithreaded Web page retrieval

– Documentation : https://www.cs.cmu.edu/~rcm/websphinx/doc/index.html

– Official site : https://www.cs.cmu.edu/~rcm/websphinx/#about

18. Spiderman :

  • Language : Java
  • Github star : 2400
  • Support

Description :

  • Spiderman is a Java open source web data extraction tool. It collects specific web pages and extracts useful data from those pages.
  • Spiderman mainly uses techniques such as XPath and regular expressions to extract real data.

Features :

  • higher performance
  • collection state persistence
  • Distributed
  • support JS script

– Documentation : https://gitee.com/l-weiwei/spiderman

– Official site : https://gitee.com/l-weiwei/spiderman

19. WebCollector :

  • Language : Java
  • Github star : 1986
  • Support

Description :

  • WebCollector is an open source web crawler framework based on Java.
  • It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

Features :

– Documentation : https://github.com/CrawlScript/WebCollector

– Official site : https://github.com/CrawlScript/WebCollector

20. Webmagic :

  • Language : Java
  • Github star : 6891
  • Support

Web Magic

Web Magic

Description :

  • A scalable crawler framework.
  • It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent.
  • It can simplify the development of a specific crawler.

Features :

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

– Documentation : http://webmagic.io/docs/en/

– Official site : https://github.com/code4craft/webmagic

21. StormCrawler :

  • Language : Java
  • Github star : 437
  • Support

Storm Crawler

Storm Crawler

Description :

  • StormCrawler is an open source SDK for building distributed web crawlers based on Apache Storm.
  • StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers.
  • StormCrawler is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required.

Features :

  • scalable
  • resilient
  • low latency
  • easy to extend
  • polite yet efficient

– Documentation : http://stormcrawler.net/docs/api/

– Official site : http://stormcrawler.net/

Open Source Web Crawler JavaScript :

22. Node-Crawler :

  • Language : Javascript
  • Github star : 3999
  • Support

Crawler

Crawler

Description :

  • Nodecrawler is a popular web crawler for NodeJS, making it a very fast crawling solution.
  • If you prefer coding in JavaScript, or you are dealing with mostly a Javascript project, Nodecrawler will be the most suitable web crawler to use. Its installation is pretty simple too.
  • JSDOM and Cheerio (used for HTML parsing) use it for server-side rendering, with JSDOM being more robust.

Features :

  • server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM
  • Configurable pool size and retries
  • Control rate limit
  • Priority queue of requests
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion
  • Compatible with 4.x or newer version

– Documentation : https://github.com/bda-research/node-crawler

– Official site : http://nodecrawler.org/

23. Simplecrawler :

  • Language : Javascript
  • Github star :1764
  • Support

Description :

  • simplecrawler is designed to provide a basic, flexible and robust API for crawling websites.
  • It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.

Features :

  • Provides some simple logic for auto-detecting linked resources – which you can replace or augment
  • Automatically respects any robots.txt rules
  • Has a flexible queue system which can be frozen to disk and defrosted

– Documentation : https://github.com/simplecrawler/simplecrawler

– Official site : https://www.npmjs.com/package/simplecrawler

24. Js-crawler :

  • Language : Javascript
  • Github star : 167
  • Support

Description :

  • Web crawler for Node.JS, both HTTP and HTTPS are supported.

– Documentation : https://github.com/antivanov/js-crawler

– Official site : https://github.com/antivanov/js-crawler

25. Webster :

  • Language : Javascript
  • Github star : 201
  • Support

Description :

  • Webster is a reliable web crawling and scraping framework written with Node.js, used to crawl websites and extract structured data from their pages.
  • Which is different from other crawling framework is that Webster can scrape the content which rendered by browser client side javascript and ajax request.

– Documentation : http://webster.zhuyingda.com/

– Official site : https://github.com/zhuyingda/webster

26. Node-osmosis :

  • Language : Javascript
  • Github star : 3630
  • Support

Description :

  • HTML/XML parser and web scraper for NodeJS.

Features :

  • Uses native libxml C bindings
  • Clean promise-like interface
  • Supports CSS 3.0 and XPath 1.0 selector hybrids
  • Sizzle selectors, Slick selectors, and more
  • No large dependencies like jQuery, cheerio, or jsdom
  • Compose deep and complex data structures
  • HTML parser features
    • Fast parsing
    • Very fast searching
    • Small memory footprint
  • HTML DOM features
    • Load and search ajax content
    • DOM interaction and events
    • Execute embedded and remote scripts
    • Execute code in the DOM
  • HTTP request features
    • Logs urls, redirects, and errors
    • Cookie jar and custom cookies/headers/user agent
    • Login/form submission, session cookies, and basic auth
    • Single proxy or multiple proxies and handles proxy failure
    • Retries and redirect limits

– Documentation : https://rchipka.github.io/node-osmosis/global.html

– Official site : https://www.npmjs.com/package/osmosis

27. Supercrawler :

  • Language : Javascript
  • Github star : 4341
  • Support

Description :

  • Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
  • When Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.

Features :

  • Link Detection : Supercrawler will parse crawled HTML documents, identify links and add them to the queue.
  • Robots Parsing : Supercrawler will request robots.txt and check the rules before crawling. It will also identify any sitemaps.
  • Sitemaps Parsing : Supercrawler will read links from XML sitemap files, and add links to the queue.
  • Concurrency Limiting : Supercrawler limits the number of requests sent out at any one time.
  • Rate limiting : Supercrawler will add a delay between requests to avoid bombarding servers.
  • Exponential Backoff Retry : Supercrawler will retry failed requests after 1 hour, then 2 hours, then 4 hours, etc. To use this feature, you must use the database-backed or Redis-backed crawl queue.
  • Hostname Balancing : Supercrawler will fairly split requests between different hostnames. To use this feature, you must use the Redis-backed crawl queue.

– Documentation : https://github.com/brendonboshell/supercrawler

– Official site : https://github.com/brendonboshell/supercrawler

28. Web scraper chrome extension :

  • Language : Javascript
  • Github star : 775
  • Support

Description :

  • Web Scraper is a chrome browser extension built for data extraction from web pages.
  • Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted.
  • Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.
  • Scraped data later can be exported as CSV.

Features :

  • Scrape multiple pages
  • Sitemaps and scraped data are stored in browsers local storage or in CouchDB
  • Multiple data selection types
  • Extract data from dynamic pages (JavaScript+AJAX)
  • Browse scraped data
  • Export scraped data as CSV
  • Import, Export sitemaps
  • Depends only on Chrome browser

– Documentation : https://www.webscraper.io/documentation

– Official site : https://www.webscraper.io

29. Headless chrome crawler :

  • Language : Javascript
  • Github star : 3256
  • Support

Headless Chrome Crawler

Headless Chrome Crawler

Description :

  • Crawlers based on simple requests to HTML files are generally fast.
  • However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js.

Features :

  • Distributed crawling
  • Configure concurrency, delay and retry
  • Support both depth-first search and breadth-first search algorithm
  • Pluggable cache storages such as Redis
  • Support CSV and JSON Lines for exporting results
  • Pause at the max request and resume at any time
  • Insert jQuery automatically for scraping
  • Save screenshots for the crawling evidence
  • Emulate devices and user agents
  • Priority queue for crawling efficiency

– Documentation : https://github.com/yujiosaka/headless-chrome-crawler/blob/master/docs/API.md

– Official site : https://github.com/yujiosaka/headless-chrome-crawler

30. X-ray :

  • Language : Javascript
  • Github star : 4464
  • Support

X-Ray

X-Ray

Features :

  • Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you’re scraping, allowing you to pull the data in the structure of your choosing.
  • Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
  • Pagination support: Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there’s an error on one page, you won’t lose what you’ve already scraped.
  • Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.
  • Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
  • Pluggable drivers: Swap in different scrapers depending on your needs.

– Documentation : https://github.com/matthewmueller/x-ray

– Official site : https://www.npmjs.com/package/x-ray-scraper

Open Source Web Crawler in C :

31. Httrack :

  • Language : C
  • Github star : 747
  • Support

Httrack

Httrack

Description :

  • HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.
  • It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.
  • HTTrack arranges the original site’s relative link-structure. Simply open a page of the “mirrored” website in your browser, and you can browse the site from link to link, as if you were viewing it online.
  • HTTrack can also update an existing mirrored site, and resume interrupted downloads.
  • HTTrack is fully configurable, and has an integrated help system.

Features :

  • Multilingual Windows and Linux/Unix interface
  • Mirror one site, or more than one site together
  • Filter by file type, link location, structure depth, file size, site size, accepted or refused sites or filename
  • Proxy support to maximize speed, with optional authentication

– Documentation : http://www.httrack.com/html/index.html

– Official site : http://www.httrack.com/

32. GNU Wget :

  • Language : C
  • Github star : 22
  • Support

GNU Wget

GNU Wget

Description :

  • GNU Wget is a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS the most widely-used Internet protocols.
  • It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

Features :

  • Can resume aborted downloads, using REST and RANGE
  • NLS-based message files for many different languages
  • Runs on most UNIX-like operating systems as well as Microsoft Windows
  • Supports HTTP proxies
  • Supports HTTP cookies

– Documentation : https://www.gnu.org/software/wget/manual/

– Official site : https://www.gnu.org/software/wget/

Open Source Web Crawler in C++ :

33. Open-source-search-engine :

  • Language : C++
  • Github star : 912
  • Support

Description :

  • An open source web and enterprise search engine and spider/crawler
  • Gigablast is one of a handful of search engines in the United States that maintains its own searchable index of over a billion pages.

Features :

  • Large scale
  • High performance
  • Real time information retrieval technology

– Documentation : http://www.gigablast.com/api.html

– Official site : http://www.gigablast.com/

Open Source Web Crawler in C# :

34. Arachnode.net :

  • Language : C#
  • Github star : 9
  • Support

Description :

  • Arachnode.net is for those who are looking for open source web crawlers in is a C#. Arachnode.net is a class library which downloads content from the internet, indexes this content and provides methods to customize the process.
  • You can use the tool for personal content aggregation or you can use the tool for extracting, collecting and parse downloaded content into multiple forms. Discovered content is indexed and stored in Lucene.NET indexes.
  • Arachnode.net is a good software solution for text mining purposes as well as for learning advanced crawling techniques.

Features :

  • Configurable rules and actions
  • Lucene.NET Integration
  • SQL Server and full-text indexing
  • .DOC/.PDF/.PPT/.XLS Indexing
  • HTML to XML and XHTML
  • Full JavaScript/AJAX Functionality
  • Multi-threading and throttling
  • Respectful crawling
  • Analysis services

– Documentation : https://documentation.arachnode.net/index.html

– Official site : http://arachnode.net/

35. Abot :

  • Language : C#
  • Github star : 1392
  • Support

Description :

  • Abot is an open source C# web crawler built for speed and flexibility.
  • It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..).
  • You just register for events to process the page data.
  • You can also plugin your own implementations of core interfaces to take complete control over the crawl process.

Features :

  • It’s fast!!
  • Easily customizable (Pluggable architecture allows you to decide what gets crawled and how)
  • Heavily unit tested (High code coverage)
  • Very lightweight (not over engineered)
  • No out of process dependencies (database, installed services, etc…)

– Documentation : https://github.com/sjdirect/abot

– Official site : https://github.com/sjdirect/abot

36. Hawk :

  • Language : C#
  • Github star : 1875
  • Support

Description :

  • HAWK requires no programming, visible graphical data acquisition and cleaning tools, open source according to the GPL protocol.

Features :

  • Intelligent analysis of web content without programming
  • WYSIWYG, visual drag and drop, fast data processing such as conversion and filtering
  • Can import and export from various databases and files
  • Tasks can be saved and reused
  • The most suitable area is reptile and data cleaning, but its power is far beyond this.

– Documentation : https://github.com/ferventdesert/Hawk

– Official site : https://ferventdesert.github.io/Hawk/

37. SkyScraper :

  • Language : C#
  • Github star : 39
  • Support

Description :

  • An asynchronous web scraper / web crawler using async / await and Reactive Extensions

– Documentation : https://github.com/JonCanning/SkyScraper

– Official site : https://github.com/JonCanning/SkyScraper

Open Source Web Crawler in .NET :

38. DotnetSpider :

  • Language : .NET
  • Github star : 1382
  • Support

Description :

  • DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight ,efficient and fast high-level web crawling & scraping framework for .NET

– Documentation : https://github.com/dotnetcore/DotnetSpider/wiki

– Official site : https://github.com/dotnetcore/DotnetSpider

Open Source Web Crawler in PHP :

39. Goutte :

  • Language : PHP
  • Github star : 6574
  • Support

Description :

  • Goutte is a screen scraping and web crawling library for PHP.
  • Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

– Documentation : https://goutte.readthedocs.io/en/latest/

– Official site : https://github.com/FriendsOfPHP/Goutte

40. Dom-crawler :

  • Language : PHP
  • Github star : 1340
  • Support

Description :

  • The DomCrawler component eases DOM navigation for HTML and XML documents

– Documentation : https://symfony.com/doc/current/components/dom_crawler.html

– Official site : https://github.com/symfony/dom-crawler

41. Pspider :

  • Language : PHP
  • Github star : 249
  • Support

Description :

  • This is a parallel crawling (crawler) framework recently developed using pure PHP code, based on the hightmanhttpclient component.

– Documentation : https://github.com/hightman/pspider

– Official site : https://github.com/hightman/pspider

42. Php-spider :

  • Language : PHP
  • Github star : 1023
  • Support

Description :

  • A configurable and extensible PHP web spider

Features :

  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • collects statistics about the crawl for reporting

– Documentation : https://github.com/mvdbos/php-spider

– Official site : https://github.com/mvdbos/php-spider

43. Spatie / Crawler :

  • Language : PHP
  • Github star : 740
  • Support

Description :

  • This package provides a class to crawl links on a website.
  • Under the hood Guzzle promises are used to crawl multiple urls concurrently.
  • Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

– Documentation : https://github.com/spatie/crawler

– Official site : https://github.com/spatie/crawler

Open Source Web Crawler in Ruby :

44. Mechanize :

  • Language : Ruby
  • Github star : 3728
  • Support

Description :

  • The Mechanize library is used for automating interaction with websites.
  • Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted.
  • Mechanize also keeps track of the sites that you have visited as a history.

– Documentation : http://docs.seattlerb.org/mechanize/

– Official site : https://github.com/sparklemotion/mechanize

Open Source Web Crawler in GO :

45. Colly :

  • Language : Go
  • Github star : 5439
  • Support

Colly

Colly

Description :

  • Lightning Fast and Elegant Scraping Framework for Gophers
  • Colly provides a clean interface to write any kind of crawler/scraper/spider.
  • With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Features :

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

– Documentation : http://go-colly.org/docs/

– Official site : http://go-colly.org/

46. Gopa :

  • Language : Go
  • Github star : 169
  • Support

Features :

  • Light weight, low footprint, memory requirement should < 100MB
  • Easy to deploy, no runtime or dependency required
  • Easy to use, no programming or scripts ability needed, out of box features

– Documentation : https://github.com/infinitbyte/gopa

– Official site : https://github.com/infinitbyte/gopa

47. Pholcus :

  • Language : Go
  • Github star : 4341
  • Support

Pholcus

Pholcus

Description :

  • Pholcus is a purely high-concurrency, heavyweight crawler software written in pure Go language.
  • It is targeted at Internet data collection and provides a feature that only requires attention to rule customization for those with a basic Go or JS programming foundation.
  • The rules are simple and flexible, batch tasks are concurrent, and output methods are rich (mysql/mongodb/kafka/csv/excel, etc.).
  • There is a large amount of Demo sharing; in addition, it supports two horizontal and vertical crawl modes, supporting a series of advanced functions such as simulated login and task pause and cancel.

Features :

  • A powerful reptile tool.
  • It supports three operating modes: stand-alone, server, and client.
  • It has three operation interfaces: Web, GUI, and command line.

– Documentation : https://pholcus.gitbooks.io/docs/

– Official site : https://github.com/henrylee2cn/pholcus

Open Source Web Crawler in R :

48. Rvest :

  • Language : R
  • Github star : 969
  • Support

Description :

  • rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

– Documentation : https://cran.r-project.org/web/packages/rvest/rvest.pdf

– Official site : https://github.com/hadley/rvest

Scala :

49. Sparkler :

  • Language : Scala
  • Github star : 198
  • Support

Description :

  • A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc.
  • Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j.

Features :

  • Provides Higher performance and fault tolerance
  • Supports complex and near real-time analytics
  • Streams out the content in real-time
  • Extensible plugin framework
  • Universal Parser

– Documentation : http://irds.usc.edu/sparkler/dev/development-environment-setup.html#contributing-source

– Official site : http://irds.usc.edu/sparkler/

Open Source Web Crawler in Perl :

50. Web-scraper :

  • Language : Perl
  • Github star : 91
  • Support

Description :

  • Web Scraper is Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions

– Documentation : https://github.com/miyagawa/web-scraper

– Official site : https://github.com/miyagawa/web-scraper

Conclusion

The universe of open source web crawling applications is vast and mind-boggling.

Each one is packed with its unique features and what it can accomplish for you.

Based on your need and technical know-how, you can capitalize on these tools. You may or may not obsess with any one tool. In fact, you may use different tools for different tasks you may come across.

It actually depends on the end user. However, it is paramount that you understand the unique strengths of each tool and harness its strengths to leverage your business or any other task you have undertaken.

Feel free to write to us regarding any queries you might have regarding any of these tools.

Do share your valuable feedback and comments regarding the blog!

post footer imagepost footer image

Transform Any Websites Into Data

ProWebScraper helps you to extract web data at scale

Scrape 1000 Pages for Free

Source

Which is the best search engine for finding images?

Images make the web beat. And human beings process visuals faster than they do text. In the last decade, the number of images uploaded on the internet has exploded.

Finding the perfect image to feature on your website, blog post or marketing email can be crucial to grabbing the audience’s attention, livening up a page, or illustrating a point. (And if you optimize it properly, it can also be beneficial to your SEO). To do that, you of course need a good search engine.

The web has plenty of different options for image search, from general search engines with an image search function to dedicated search engines for browsing and indexing images. But which offer the best experience?

In this post, we’re going to compare the best search engines for conducting three categories of image search on the web.

Category 1: General image search

Ever searched for [word + image] on the web? This is the basic type of image search people do on the internet and it comes in handy for day-to-day searches.

The top search engines for performing general searches are as follows:

Google Images

Google remains the go-to source for information, not only because of its large database but simply because its interface is one of the best.  You can use several filters for your searches and also search for images by voice.

Using its advanced search options, you can filter images by size, color, type of image (photo, clip art, etc) and you can also search for images on a specific site. For example, you could search for images of a PC solely from makeuseof.com or pcmag.com.

Google Images advanced search result for the term PC, from www.makeuseof.com or www.pcmag.com

Unfortunately, the advanced search option isn’t visible on the landing page, so to reach it, searchers will need to select ‘Settings’ and then ‘Advanced search’. This will navigate you to a separate page where you can input your desired parameters before being taken to image search results.

Images also appear as thumbnails and don’t enlarge on hover, so you have to click through to get a full view of the images. If you’re wary of Google’s all-seeing eye, then you may be interested in some alternative search engines, which will be discussed below.

Resources

Bing Images

Bing is Google’s top contender when it comes to search, and image search is no different. Whereas Google’s interface can appear bland to some people, Bing’s interface is rich and colorful. As Jessie Moore wrote in her recent article, image search may be one of those things that Bing does better than Google.

Similar to Google, searchers can filter photos by color, type, layout, image size, and – crucially to people looking for Creative Commons licensed images – license. Unlike Google, Bing’s filter options are available on the search results pages so you don’t have to navigate away from the page. The only real drawback to Bing’s image search is that you can’t search for images by voice.

Yahoo image search

Though Yahoo might seem a bit passé to many of our readers, for image search, Yahoo is genuinely one of the best options. Its ownership of image-sharing site Flickr comes in really handy here, as photos from Flickr are integrated in image search results, making it a go-to source for custom, user-generated images. Flickr users also have the option to simply save images from their searches to their Flickr account.

The Yahoo search interface is also sleek and straight to the point. Like the Bing interface, all image filters are available on the search results page, so users can set their preferences easily to fine-tune the results.

Category 2: Reverse image search

Ever found a picture of a strange animal or building and wanted to learn more about it? That’s where reverse image search comes in. Although this search method is relatively new, it has increasingly become popular.  And it comes in really handy for webmasters and content creators.

Here are some of the benefits of reverse image search:

  1. Verifying the source of an image. With reverse image search, you can trace the original source of an image and how the image has changed over time. It is particularly effective for authenticating people profiles, news stories, and images of events.
  1. Tracking copyrighted images. Photographers and content creators (e.g. of infographics) can use reverse image search to learn how their content is used on the internet. If you create your own images, this can help you keep track of who is using your images without attribution.
  1. Finding similar images. Reverse searching images can help you find better shots or options for an image.

Now that you know the benefits of reverse image search, here are three of the best search engines for getting the job done:

TinEye Reverse Image Search

Tineye is the pioneer when it comes to reverse image search engine. The service was launched in 2008, three years before Google included an option for reverse search.

Users can either upload an image to the site or provide the image’s URL and the site finds similar images from its over 24 billion image repository. File sizes are limited to 20MB, and the image has to be in JPG, PNG or GIF formats. Users can sort their results by best match, most changed, biggest image, and so on.

TinEye comes in a free and premium version. With the free version, users can perform a maximum of 150 searches per month. For more advanced features, you have to pay for the premium version at $200/year.

Google reverse image search

Unsurprisingly, Google is another leader in reverse image search, which was launched as a feature in June 2011. Unlike Tineye, there is no limit to the size of images that can be uploaded to Google.

Chrome users can simply right click on an image anywhere within Chrome and select “search the web for this image”. The search returns a “best guess for this image” description, as well as pages that include matching images.

Pinterest visual search tool

This tool is best for Pinterest users because you need a Pinterest account to use it. With this tool, users can crop a specific area of an image to search for instead of searching for the entire image. The feature was announced in November 2015 and is perfect for heavy Pinterest users.

Once a user clicks on the image search button, results of similar images are shown almost immediately.

Category 3: Free-to-use images

As you must have noticed, most of the images from the first two categories are normally subject to copyright, and you can’t simply pluck the image and use it on your own blog or website.

So what if you run a blog and are looking for free images for your website?

There’s a third category of image search engines that only search for free photos on the web. These photos are licensed under creative commons and are pulled in from several stock photo sites.

It is important to note that the big search engines like Google, Bing, and Yahoo also allow users to search for free images via their “license” filter. By setting the license to Creative Commons, you can find free images on all three search sites.

Here are some other useful search engines for finding Creative Commons licensed images:

EveryPixel

EveryPixel indexes 51 paid and free stock image sites including Shutterstock, Pixabay, Unsplash and lots of others. Searchers can filter images by source, orientation, color and image type.

Librestock

Librestock allows you to “search the best 47 free stock photo websites in one place”. Unlike the first two sites, Librestock indexes only images licensed under the Creative Commons Zero (CC0), i.e. public domain images, which means you can use the photos freely without attribution for any legal purpose.

The downside is that there aren’t many pictures available, and there are no filters.

Creative Commons (CC) Search

CC Search is not a search engine in its own right, as is clearly stated on the site, but rather an interface that allows users to search several free photo sites without leaving the CC search page. Image sources include Flickr, Pixabay, Google Images and Wikimedia Commons. The site also includes options for finding media such as sound and video.

Conclusion: Which is the best search engine for images?

Search engines make life easier and come in handy for image search. So which is the best search engine for running image searches?

There’s really no single “best” search engine; each search engine has its perks and downsides depending on which type of search you’re carrying out. Google is a versatile option, combining a powerful general and reverse image search in one.

However, with its attractive visual interface and easy-to-find filtering options, Bing is a strong contender for general image searches, while TinEye offers more fine-tuning and often better suggestions than Google’s reverse image search.

Google, Bing and Yahoo all have options for searching by Creative Commons-licensed images, with Yahoo having the advantage of integration with Flickr, but a dedicated stock image search engine like EveryPixel will give you a wider choice of suitable images.

Ultimately, there are a lot of great tools out there for finding images depending on your needs, and by using them in combination, you can track down the perfect image.

Which image search engines do you use?



Source

Online Research Tools and Investigative Techniques

Tools & Techniques
By Paul Myers | May 5, 2015

বাংলা

book-cover2Editor’s Note: The Verification Handbook for Investigative Reporting is a new guide to online search and research techniques to using user-generated content and open source information in investigations. Published by the European Journalism Centre, a GIJN member based in the Netherlands, the manual consists of ten chapters and is available for free download.

We’re pleased to reprint below chapter 3, by Internet research specialist Paul Myers.  For a comprehensive look at online research tools, see Myers’ Research & Investigative Links.

Search engines are an intrinsic part of the array of commonly used “open source” research tools. Together with social media, domain name look-ups, and more traditional solutions such as newspapers and telephone directories, effective web searching will help you find vital information to support your investigation.

Many people find that search engines often bring up disappointing results from dubious sources. A few tricks, however, can ensure that you corner the pages you are looking for, from sites you can trust. The same goes for searching social networks and other sources to locate people: A bit of strategy and an understanding of how to extract what you need will improve results.

This chapter focuses on three areas of online investigation:

  1. Effective web searching.
  2. Finding people online.
  3. Identifying domain ownership.

1. Effective web searching

Search engines like Google don’t actually know what web pages are about. They do, however, know the words that are on the pages. So to get a search engine to behave itself, you need to work out which words are on your target pages.

First off, choose your search terms wisely. Each word you add to the search focuses the results by eliminating results that don’t include your chosen keywords.

Some words are on every page you are after. Other words might or might not be on the target page. Try to avoid those subjective keywords, as they can eliminate useful pages from the results.

Use advanced search syntax.

Most search engines have useful so-called hidden features that are essential to helping focus your search and improve results.

Optional keywords

If you don’t have definite keywords, you can still build in other possible keywords without damaging the results. For example, pages discussing heroin use in Texas might not include the word “Texas”; they may just mention the names of different cities. You can build these into your search as optional keywords by separating them with the word OR (in capital letters).

You can use the same technique to search for different spellings of the name of an individual, company or organization.

Search by domain

You can focus your search on a particular site by using the search syntax “site:” followed by the domain name.

For example, to restrict your search to results from Twitter:

To add Facebook to the search, simply use “OR” again:

You can use this technique to focus on a particular company’s website, for example. Google will then return results only from that site.

You can also use it to focus your search on municipal and academic sources, too. This is particularly effective when researching countries that use unique domain types for government and university sites.

Note: When searching academic websites, be sure to check whether the page you find is written or maintained by the university, one of its professors or one of the students. As always, the specific source matters.

Searching for file types

Some information comes in certain types of file formats. For instance, statistics, figures and data often appear in Excel spreadsheets. Professionally produced reports can often be found in PDF documents. You can specify a format in your search by using “filetype:” followed by the desired data file extension (xls for spreadsheet, docx for Word documents, etc.).

2. Finding people

Groups can be easy to find online, but it’s often trickier to find an individual person. Start by building a dossier on the person you’re trying to locate or learn more about. This can include the following:

  • The person’s name, bearing in mind:
    • Different variations (does James call himself “James,” “Jim,” “Jimmy” or “Jamie”?).
    • The spelling of foreign names in Roman letters (is Yusef spelled “Yousef” or “Yusuf”?).
    • Did the names change when a person married?
    • Do you know a middle name or initial?
  • The town the person lives in and or was born in.
  • The person’s job and company.
  • Their friends and family members’ names, as these may appear in friends and follower lists.
  • The person’s phone number, which is now searchable in Facebook and may appear on web pages found in Google searches.
  • Any of the person’s usernames, as these are often constant across various social networks.
  • The person’s email address, as these may be entered into Facebook to reveal linked accounts. If you don’t know an email address, but have an idea of the domain the person uses, sites such as email-format can help you guess it.
  • A photograph, as this can help you find the right person, if the name is common.

Advanced social media searches: Facebook

Facebook’s newly launched search tool is amazing. Unlike previous Facebook searches, it will let you find people by different criteria including, for the first time, the pages someone has Liked. It also enables you to perform keyword searches on Facebook pages.

This keyword search, the most recent feature, sadly does not incorporate any advanced search filters (yet). It also seems to restrict its search to posts from your social circle, their favorite pages and from some high-profile accounts.

Aside from keywords in posts, the search can be directed at people, pages, photos, events, places, groups and apps. The search results for each are available in clickable tabs.

For example, a simple search for Chelsea will find bring up related pages and posts in the Posts tab:

The People tab brings up people named Chelsea. As with the other tabs, the order of results is weighted in favor of connections to your friends and favorite pages.

The Photos tab will bring up photos posted publicly, or posted by friends that are related to the word Chelsea (such as Chelsea Clinton, Chelsea Football Club or your friends on a night out in the Chelsea district of London).

The real investigative value of Facebook’s search becomes apparent when you start focusing a search on what you really want.

For example, if you are investigating links between extremist groups and football, you might want to search for people who like The English Defence League and Chelsea Football Club. To reveal the results, remember to click on the “People” tab.

This search tool is new and Facebook are still ironing out the creases, so you may need a few attempts at wording your search. That said, it is worth your patience.

Facebook also allows you to add all sorts of modifiers and filters to your search. For example, you can specify marital status, sexuality, religion, political views, pages people like, groups they have joined and areas they live or grew up in. You can specify where they studied, what job they do and which company they work for. You can even find the comments that someone has added to uploaded photos. You can find someone by name or find photos someone has been tagged in. You can list people who have participated in events and visited named locations. Moreover, you can combine all these factors into elaborate, imaginative, sophisticated searches and find results you never knew possible. That said, you may find still better results searching the site via search engines like Google (add “site:facebook.com” to the search box).

Advanced social media searches: Twitter

Many of the other social networks allow advanced searches that often go far beyond the simple “keyword on page” search offered by sites such as Google. Twitter’s advanced search, for example, allows you to trace conversations between users and add a date range to your search.

Twitter allows third-party sites to use its data and create their own exciting searches.
Followerwonk, for example, lets you search Twitter bios and compare different users. Topsy has a great archive of tweets, along with other unique functionality.

Advanced social media searches: LinkedIn

LinkedIn will let you search various fields including location, university attended, current company, past company or seniority.

You have to log in to LinkedIn in order to use the advanced search, so remember to check your privacy settings. You wouldn’t want to leave traceable footprints on the profile of someone you are investigating!

You can get into LinkedIn’s advanced search by clicking on the link next to the search box. Be sure, also, to select “3rd + Everyone Else” under relationship. Otherwise , your search will include your friends and colleagues and their friends.

LinkedIn was primarily designed for business networking. Its advanced search seems to have been designed primarily for recruiters, but it is still very useful for investigators and journalists. Personal data exists in clearly defined subject fields, so it is easy to specify each element of your search.

You can enter normal keywords, first and last names, locations, current and previous employers, universities and other factors. Subscribers to their premium service can specify company size and job role.

LinkedIn will let you search various fields including location, university attended, current company, past company and seniority.

Other options

Sites like Geofeedia and Echosec allow you to find tweets, Facebook posts, YouTube videos, Flickr and Instagram photos that were sent from defined locations. Draw a box over a region or a building and reveal the social media activity.  Geosocialfootprint.com will plot a Twitter user’s activity onto a map (all assuming the users have enabled location for their accounts).

Additionally, specialist “people research” tools like Pipl and Spokeo can do a lot of the hard legwork for your investigation by searching for the subject on multiple databases, social networks and even dating websites. Just enter a name, email address or username and let the search do the rest. Another option is to use the multisearch tool from Storyful. It’s a browser plugin for Chrome that enables you to enter a single search term, such as a username, and get results from Twitter, Instagram, YouTube, Tumblr and Spokeo. Each site opens in a new browser tab with the relevant results.

Searching by profile pic

People often use the same photo as a profile picture for different social networks. This being the case, a reverse image search on sites like TinEye and Google Images, will help you identify linked accounts.

3. Identifying domain ownership

Many journalists have been fooled by malicious websites. Since it’s easy for anyone to buy an unclaimed .com, .net or .org site, we should not go on face value. A site that looks well produced and has authentic-sounding domain name may still be a political hoax, false company or satirical prank.

Some degree of quality control can be achieved by examining the domain name itself. Google it and see what other people are saying about the site. A “whois” search is also essential. DomainTools.com is one of many sites that offers the ability to perform a whois search. It will bring up the registration details given by the site owner the domain name was purchased.

For example, the World Trade Organization was preceded by the General Agreement on Tariffs and Trades (GATT). There are, apparently, two sites representing the WTO. There’s wto.org (genuine) and gatt.org (a hoax). A mere look at the site hosted at gatt.org should tell most researchers that something is wrong, but journalists have been fooled before.

A whois search dispels any doubt by revealing the domain name registration information. Wto.org is registered to the International Computing Centre of the United Nations. Gatt.org, however, is registered to “Andy Bichlbaum” from the notorious pranksters the Yes Men.

Whois is not a panacea for verification. People can often get away with lying on a domain registration form. Some people will use an anonymizing service like Domains by Proxy, but combining a whois search with other domain name and IP address tools forms a valuable weapon in the battle to provide useful material from authentic sources.

To know more, check out the tipsheets about online research and verification, prepared by Raymond Joseph for the African Investigative Journalism Conference 2016.

PaulMyersPaul Myers is a BBC Internet research specialist. He also runs The Internet Research Clinic, a website dedicated to directing journalists to the best research links, apps, and resources. At the BBC Academy, he runs training courses that include online investigation, data journalism, social media, statistics, and web design. Paul has also helped train personnel from The Guardian, the Daily Telegraph, the Times, Channel 4, CNN, the World Bank and the UNDP.

Source

10 Top Best Free & Open Source Social Network Platforms To start Your Own

– Advertisement –

Talking about Social Media Network websites the only names come to our mind, those are best such as Facebook, Twitter, and few others. In today’s world, the social network is more than just a chatting platform, it is now a source of knowledge and awareness. Before developing any social network sites you should need to have a deep knowledge of PHP, MySQL, and Linux. But even having a knowledge of coding languages, still developing a social media platform is a long, slow and time-consuming task and also nobody guarantees about its success. So, how to make a social networking site?  To build your own Social networking website you need some tools and open source social network development platforms are one of those. They come with pre inbuilt tools those are flexible and helps you to easily customize and build your own on top it.

There are a couple good online platforms available those allow to create social networking sites but rather than using online platforms try to self-hosted social network software to get more control on your social networking website.

There are plenty of paid and free scripts to create a social network but if you are looking for the only free and open source, please see the below-given list.

 

Best open-source Software platforms

 SEE:

Elgg

The Elgg is an open source social network software which is free to download. It is built on a framework that allows creating any kind of social environment; whether to start a social network for school, colleges, or for an organization to build communities you can use the Elgg. It is a 2008 award-winning open source social networking engine. Elgg uses the Apache, PHP, MySQL and Linux environments and has a good community to solve the arising issues with a repository of 1000+ open source plugins.

Elgg is an open source social network software

Elgg  features 

  • Well-documented core API for developers to easily start and learn
  • Composer to make the installation of Elgg easy and simple, also maintain the Elgg core and plugins
  • A flexible system of hooks to allows  extension and modifications of application with help of plugins, custom themes
  • Cacheable system for good performance, user authentication, built-in security system such as anti-CSRF validation, strict XSS filters, HMAC signatures
  • Client-side API
  • Content access policies
  • File storage
  • Notifications service
  • RPC web services
  • And More…

 

Dolphin social networking software

Dolphin Pro is an open source software for creating custom social networks and web communities. It is written in PHP and for database uses the MYSQL. This social networking website software platform is fully modular and offers multiple modules such as Ads, Payments, Photos, Polls, Profile Customizer, Profiler, Chat, Profiler, Desktop, Facebook Connect, Forums, Videos, Memberships, Messenger, Page Access Control, World Map, Events, Custom RSS, Chat, SMTP Mailer, Sounds and more… It also features social profiles, timelines, likes, shares, voting, friends, Chat+ (WebRTC multiuser audio/video chat) and comments.

The Dolphin is available in three editions Free, Monthly ($29/month) and Permanent ($599/one-time). In the free version of your social media, it shows powered by Dolphin badge.

Opensource social network

opensource social network facebook

The OSSN is another best open source social network software with a bit Facebook-like interface and features such messaging, friend request panel and few other elements. It allows creating a full-featured social media network platform that allows groups, photos, files, messages and more. OSSN is multilanguage social network software, however, you can add as many as languages you want. It is available in two versions basic and premium, furthermore user can download it as an installer (Linux) or virtual image.

The Open source social network features third Party integrations, Tools Themes, Games, Audio Video Call, Authentication (Google reCAPTCHA) and more.

Humhub

HumHub is a free and open source social network software kit and framework with a user-friendly interface just like Facebook. It is lightweight and features multiple tools to make communication and collaboration easy.  The Humhub offers you an ability to customize it to built and create your own customized social network, social intranet or huge social enterprise application.

The HubHum is a flexible system and offers a modular design that can be extended using the third party tools to connect existing software or any other even written by you. The Humhub offers a self-hosted solution which gives full control over your social network, means your server, your data, and your rules.  Community and enterprise edition options are available.

HumHub Social Network Software Features 

  • Notifications
  • Activity Stream
  • Social tools
  • Files
  • Directory
  • Groups
  • User profiles
  • Share content with non-registered users
  • Search files and peoples
  • Mobile Ready
  • And More

Oxwall

Oxwall is a free social network software platformAbove image is a customized Oxwall theme

Oxwall is a free social network software cum content management system. It based on PHP and uses MYSQL as a database to deploy the social network environment development. It available in three editions Free, Starter solution ($249) and Advanced solution ($2999). In the free edition, you will get the Oxwall software, Access to developers forum, Access to third-party plugins, and Access to the documentation. Thier CMS is compatible with all type of websites and scalable too.

 Oxwall social network CMS (content management system) Features:

  • Facebook Connect to Login easily
  • Facebook-style friend system
  • chat and more
  • Google Analytics
  • Facebook-like newsfeed
  • Video embeds from Youtube, Vimeo, Dailymotion, etc.
  • Social media sharing
  • Activity notifications
  • User blogs
  • Contact importer to invite friends
  • Groups
  • Photo sharing
  • Create online and offline events
  • Displaying users’ birthdays  Like Facebook
  • Privacy control
  • Image slideshow
  • Cloudflare integration
  • And More…

BuddyPress

BuddyPress is a product of the well-known content management system WordPress. It helps you to create social media networking websites with WordPress. It is simple and tons of themes available online for it, those help you to easily customize the look and feel of your social network website. BuddyPress is based on PHP and can be customized easily if you have the coding knowledge. BuddyPress is completely free & open source social network development platform.

The BuddyPress social content management system features Custom profile fields, personal profile, email notifications (Smart read/unread), allow your users to create micro-communities, plugins and extensions support, private messaging,  friendship connections, a platform for discussions and much more.

Other Available Opensource and Free Social Network Software projects

Apart from the above best and top social network platforms, here are few other free software available online for creating a social network and collaboration.

pH7 Social Dating Software

pH7CMS is for those people interesting in building the social dating websites.  Its totally open source enterprise-class social dating web app builder. The pH7CMS allow developers to start social dating websites like Tinder or Badoo. As it is an open source, so a person with knowledge of PHP coding can easily customize it to full fill the custom social network requirements.

Jcow

Jcow is a social networking script written in PHP, helps to make your own niche social network and online community. It has a Facebook-like interface.

Jamroom – Self Hosted

Jamroom Open Source can host on the personal servers. It also available in Premium & Professional editions with premium features those are paid.

eXo Tribe

Free eXo-based online collaboration platform dedicated to the community of customer.

Peepso

It is a plugin to enable the social networking capabilities of WordPress CMS based websites. It features Friends, Targeted Ads, Photos, Extended Profiles, Groups, BlogPosts, Chat, and Reactions.

AstroSPACES

Free and open source social network software coded from scratch, web-based and written PHP Programming Language.

Insoshi social software

Insoshi is a social networking platform developed in Ruby on Rails. It is a free software and can use to create custom social networks. The compiled and source code of Insoshi both are available on the Github.

Friendi

Friendica is to create a distributed social network.  It is free software and developed by many people around the world. It features Post “Status Updates”, Photos, albums, tagging, privacy, Events Calendar, privacy with military encryption, relationship Control, browsing Network Filter, Themes and Plugins and much more.

AROUNDMe

AROUNDMe allows creating multiple collaborative group, web space, community or social networking websites. It features tools such blog, forum, wiki, guestbook and completely customizable with XHTML, PHP, Java, and CSS. The Groups in AroundMe can be private or public.

Anahita

Anahita is another Open Source Social Networking platform & framework for developing open science and knowledge sharing applications.

Community engine

CommunityEngine is a free, open-source social network plugin for Ruby on Rails applications. User profiles, Blogs, Private messaging, Events, and Forums are some of its core features.

Mahara

Mahara is an open source social networking web application to build your electronic portfolio. You can create journals, upload files, embed third-party and collaborate with other users in groups.

Pump.io

It’s a stream server for social media networks.

You May Like:

If you know any other open source social network platform which is best in your case, please help us to grow this list. The comment box is all yours…

 

 

 

Source

8 Pros and Cons of Open Source Software

In 1998, a new software development came to be. This is the Open Source Software (OSS) or “free software”. Although this is not literally for free, a program that is open-source has its source course available for other users to use, modify, code and then distribute their own versions to other users. Moreover, virtually anyone can use the program for whatever purpose they seem fit and there are no licensing fees as well as restrictions.

What Is a Source Code?

A source code is used by computer programmers to enhance or modify software applications or programs to change the way they work. They also use these codes to fix errors in the software. Regular or ordinary computer users do not see the source codes in software.

The open source software community have increased over the years and today, it open source has become a multi-billion dollar industry considered by its supporters and critics to have advantages and disadvantages. Let’s take a look at the benefits and setbacks of this controversial movement.

List of Pros of Open Source Software

1. Good for Businesses.
Software experts who are supporters of open source software posit that businesses and non-profit organizations as well as the federal government are already adopting the application of OSS for a variety of purposes. These institutions are already accepting the concept of this movement and its relevance to developing quality software. This has helped businesses to build reputation and take advantage of this technology. Today, content management systems such as Joomla, WordPress and Plone are being used by organizations. From open source software, better versions can be created, which can be used by industries.

2. Easy to Download.
Supporters of open source software say that it has helped users to have access to software from the internet without really having to pay for it. Take for example the free software for downloading music. With it, songs can be downloaded without having to go to iTunes and pay for music downloads. Moreover, today, installing an operating system is possible without having to pay for licensed software. Ubuntu OS is available for download and can be a substitute for Windows. Although this needs some technical skills, this can still be a big savings for the average computer user who does not have the money to buy propriety software.

3. Innovation Central.
With the freedom to modify and edit an open source code, users are now not restricted to do so with the type of licensing for OSS. Closed-code software like Google Docs does not allow people to do that. The open license of OSS allows people to make better versions of an application and share it with others, who in return, can also modify and improve. In the end, the software community can benefit from this movement.

4. Informative.
Advocates for open source software claim that this movement has allowed individuals to learn and enhance their computer programming skills. Those who are starting to learn about Search Engine Optimization (SEO) and web programming can use these open source codes to learn more about programming. In fact a lot of companies have their in-house web developers copy and modify source codes from the free blogsite themes of WordPress to get ideas for client websites. By adding to the codes and modifying them, new website designs are easier to create.

List of Cons of Open Source Software

1. Not Really Free.
Critics of the open source software and its concept posit that it only gives one the freedom to modify it but in reality, it is not totally free. Considering the budget of creating open source software, which is relatively cheaper than what is spent to come up with propriety software, it is far more inferior than commercial software. Having said that, there is not much effort and detail given to its documentation, usability and more importantly, its development. Consequently, a person who will use it needs to invest longer time and effort for its installation and improvement.

2. No Guaranteed Support.
Proponents of OSS point out the lack of technical support. Take the case of open source testing tools. Although there are several open source communities willing to respond to inquiries and add new features more often than some vendors do, this does not guarantee that users can expect these OSS communities to be as supportive and friendly at all times. This is simply because they are not getting monetary compensation from doing so. In case users expect some features to be added, they have to wait for open source communities to come up with these features with no guaranteed time frame. It is true that open source software has some support available, not all applications have complete support and unfortunately, it is possible that some developers will not support these applications.

3. Not Profitable.
People who are not really fans of the open source software movement stress that if a user will use it to come up with a version and make money out of it, this is hardly possible, to say the least. This is because the license itself allows anyone to copy, modify and distribute it, making it easy for others to have their own versions and need not pay the developer.

4. Has Flaws.
Critics of OSS say that although some initiatives have been successful and still continue to thrive, there are those that have failed like Eazel and SourceXchange. For unconvinced experts, OSS does not have what it takes to produce quality systems, identify the vague process, does not have empirical evidence and in some cases, takes time to identify defects. It also can give hackers the opportunity to study the weaknesses of the software easily. Given these factors, they doubt the benefits of OSS.

Since the birth of the open source software movement, many users have reaped its rewards but according to those who are skeptical about it, OSS is not fool-proof. But come to think of it, it has been more of an advantage than a setback. Moreover, it is only an option since commercial and closed source software are still available for users.

-Flow Psychology Editor

Source

Top 11 Open Source Database for Your Next Project

Data is everything. And by extension, so are databases. Here are some fantastic open source options for your next kick-ass project.

For a world dominated so long by database suits like Oracle and SQL Server, there seems to be an endless flurry of solutions now. One part of the reason is innovation fueled by Open Source — really talented developers wanting to scratch an itch and creating something that they can revel in.

The other part is the emergence of new business models, wherein businesses maintain a community version of their product to gain mind share and traction, while also providing a commercial, add-on offering.

The result?

More databases than one can keep up with. There’s no official stat on this, but I’m pretty sure we have over a hundred options available today if you combine everything from stack-specific object databases to not-so-popular projects from universities.

I know, it frightens me, too. Too many options — too much documentation to go through — and a life that is so short.

That’s why I decided to write this article, presenting ten of the best databases you can use to improve your solutions, whether building for yourself or others.

No MySQL

Please note: this list isn’t going to contain MySQL, even though it’s arguably the most popular Open Source database solution out there.

Why? Simply because MySQL is everywhere — it’s what everyone learns first, it’s supported by virtually every CMS or framework out there, and it’s very, very good for most use cases. In other words, MySQL doesn’t need to be “discovered.”

That said, please note that the following aren’t necessarily alternatives to MySQL. In some cases, they might be, while in others they’re a completely different solution for an entirely different need. Don’t worry, as I’ll be discussing their uses also.

Special note: compatibility

Before we begin, I also must mention that compatibility is something you need to keep in mind. If you have a project that, for whatever reason, supports only a particular database engine, your choices are pretty much shot through.

For instance, if you’re running WordPress, this article is of no use to you. Similarly, those running static sites on JAMStack will gain nothing by looking for alternatives too seriously.

It’s up to you to figure out the compatibility equation. However, if you do have a blank slate and the architecture is up to you, here are some neat recommendations.

Open Source Databases

PostgreSQL

If you’re from the PHP land (WordPress, Magento, Drupal, etc.), then PostgreSQL will sound foreign to you. However, this relational database solution has been around since 1997 and is the top choice in communities like Ruby, Python, Go, etc.

In fact, many developers eventually “graduate” to PostgreSQL for the features it offers, or simply for the stability. It’s hard to convince someone in a short write-up like this but think of PostgreSQL as a thoughtfully-engineered product that never lets you down.

There are many good SQL clients available to connect to PostgreSQL database for administration and development.

Unique Features

PostgreSQL has several fascinating features as compared to other relational databases (specifically, MySQL), such as:

  • Built-in data types for Array, Range, UUID, Geolocation, etc.
  • Native support for document storage (JSON-style), XML, and key-value storage (Hstore)
  • Synchronous and asynchronous replication
  • Scriptable in PL, Perl, Python and more
  • Full-text search

My personal favorites are the geolocation engine (which takes away the pain when working with location-based apps — try finding all nearby points manually, and you’ll know what I mean) and support for arrays (many MySQL projects are undone for want of arrays, opting instead for the infamous comma-separated strings).

When to use PostgreSQL

PostgreSQL is always a better choice over any other relational database engine. That is, if you’re starting a new project and have been bitten by MySQL before, it’s a good time to consider PostgreSQL. I have friends who gave up battling MySQL’s mysterious transactional lock failures and moved on permanently. If you decide the same, you won’t be overreacting.

PostgreSQL also has a clear advantage if you need partial NoSQL facilities for a hybrid data model. Since document and key-value storage are natively supported, you don’t need to go hunting for, installing, learning, and maintaining another database solution.

When not to use PostgreSQL

PostgreSQL doesn’t make sense when your data model isn’t relational and/or when you have very specific architectural requirements. For instance, consider Analytics, where new reports are constantly being created from existing data. Such systems are read-heavy and suffer when a strict schema is imposed on them. Sure, PostgreSQL has a document storage engine, but things start to fall apart when you’re dealing with large datasets.

In other words, always use PostgreSQL, unless you know 100% what you’re doing!

Check out this SQL & PostgreSQL for Beginners course if interested in learning more.

MariaDB

MariaDB was created as a replacement for MySQL, by the same person who developed MySQL.

Confused?

Well, actually, after MySQL was taken over by Oracle in 2010 (by acquiring Sun Microsystems, which, incidentally, is also how Oracle came to control Java), the creator of MySQL started a new open source project called MariaDB.

Why does all this boring detail matter, you ask? It’s because MariaDB was created from the same code base as that of MySQL (in the open source world, this is known as “forking” an existing project). As a result, MariaDB is presented as a “drop-in” replacement for MySQL.

That is, if you’re using MySQL and want to migrate to MariaDB, the process is so easy that you just won’t believe it.

Unfortunately, such a migration is a one-way street. Going back from MariaDB to MySQL is not possible, and should you try to use force, permanent database corruption is ensured!

Unique features

While MariaDB is essentially a clone of MySQL, it’s not strictly true. Ever since the introduction of the database, the differences between the two have been growing. As of writing, adopting MariaDB needs to be a well-thought-through decision on your part. That said, there are plenty of new things going on in MariaDB that may help you make this transition:

  • Truly free and open: Since there’s no single corporate entity controlling MariaDB, you can be free of sudden predatory licensing and other worries.
  • Several more options of storage engines for specialized needs: for instance, the Spider engine for distributed transactions; ColumnStore for massive data warehousing; the ColumnStore engine for parallel, distributed storage; and many, many more.
  • Speed improvements over MySQL, especially due to the Aria storage engine for complex queries.
  • Dynamic columns for different rows in a table.
  • Better replication capabilities (for example, multi-source replication)
  • Several JSON functions
  • Virtual columns

. . . And many, many more. It’s exhausting to keep up with all the MariaDB features.

When to use MariaDB

You should MariaDB if you want a true replacement of MySQL, wants to stay on the innovation curve, and don’t plan on returning to MySQL again. One excellent use case is the use of new storage engines in MariaDB to complement the existing relational data model of your project.

When not to use MariaDB

Compatibility with MySQL is the only concern here. That said, it’s becoming less of a problem as projects like WordPress, Joomla, Magento, etc., have started supporting MariaDB. My advice would be not to use MariaDB to trick a CMS that doesn’t support it, as there are many database-specific tricks that will crash the system easily.

CockroachDB

The team behind CockroachDB seems to be composed of masochists. With a product name like that, surely they want to turn all odds against them and still win?

Well, not quite.

The idea behind “cockroach” is that it’s an insect built for survival. No matter what happens — predators, floods, eternal darkness, rotting food, bombing, the cockroach finds a way to survive and multiply.

The idea is that the team behind CockroachDB (composed of former Google engineers) was frustrated with the limitations of traditional SQL solutions when it comes to large scale. That’s because historically SQL solutions were supposed to be hosted on a single machine (data wasn’t that big). For a long time, there was no way to build a cluster of databases running SQL, which is why MongoDB captured so much attention.

Even when replication and clustering came out in MySQL, PostgreSQL, and MariaDB, it was painful at best. CoackroachDB wants to change that, bringing effortless sharding, clustering, and high availability to the world of SQL.

When to use CockroachDB

CockroachDB is the system architect’s dream come true. If you swear by SQL and have been simmering at the scaling capabilities of MongoDB, you’ll love CockroachDB. Now you can quickly set up a cluster, throw queries at it, and sleep peacefully at night.

When not to use CockroachDB

Better the devil you know than the one you don’t. By that I mean, if your existing RDBMS is working well for you and you think you can manage the scaling pains it brings, stick with it. For all the genius involved, CockroachDB is a new product, and you don’t want to be struggling against it later on. Another major reason is SQL compatibility — if you’re doing exotic SQL stuff and rely on it for critical things, CockroachDB will present too many edge cases for your liking.

From now on, we’ll consider non-SQL (or NoSQL, as it’s called) database solutions for highly specialized needs.

Neo4j

One of the most significant developments in the recent decade is connected data. The world around us is not partitioned into tables and rows and boxes — it’s one giant mess with everything connected to almost everything else.

Social networks are a prime example, and building a similar data model using SQL or even document-based databases is a nightmare.

That’s because the ideal data structure for these solutions is the graph, which is an entirely different beast. And for that, you need a graph database like Neo4j.

The example above was taken directly from the Neo4j website and shows how university students are connected to their departments and courses. Such a data model is plain impossible with SQL, as it’ll be tough to avoid infinite loops and memory overruns.

Unique features

Graph databases are unique in themselves, and Neo4j is pretty much the only option for working with graphs. As a result, whatever features it has are unique.

  • Support for transactional applications and graph analytics.
  • Data transformation abilities for digesting large-scale tabular data into graphs.
  • Specialized query language (Cypher) for querying the graph database
  • Visualization and discovery features

It’s a moot point to discuss when to use Neo4j, and when not. If you need graph-based relationships between your data, you need Neo4j.

MongoDB

MongoDB was the first non-relational database to make big waves in the tech industry and continues to dominate a fair share of attention.

Unlike relational databases, MongoDB is a “document database,” which means it stores data in chunks, with related data clumped together in the same chunk. This is best understood by imagining an aggregation of JSON structures like this:

Here, unlike a table-based structure, the contact details and access levels of a user reside inside the same object. Fetching the user object fetches the associated data automatically, and there’s no concept of a join. Here’s a more detailed intro to MongoDB.

Unique features

MongoDB has some serious (I almost want to write “kick-ass” to convey the impact, but it wouldn’t be proper on a public website, perhaps) features that have made several seasoned architects abandon the relational land forever:

  • A flexible schema for specialized/unpredictable use cases.
  • Ridiculously simple sharding and clustering. You just need to set up the configuration for a cluster and forget about it.
  • Adding or removing a node from a cluster is drop-dead simple.
  • Distributed transactional locks. This feature was missing in the earlier versions but was eventually introduced.
  • It is optimized for very fast writes, making it highly suitable for analytics data as a caching system.

If I sound like a spokesperson for MongoDB, I apologize, but it’s hard to oversell the advantages of MongoDB. Sure, NoSQL data modeling is weird at first, and some never get the hang of it, but for many architects, it almost always wins out over a table-based schema.

When to use MongoDB

MongoDB is a great crossover bridge from the structured, strict world of SQL to the amorphous, almost confusing one of NoSQL. It excels at developing prototypes, as there’s simply no schema to worry about, and when you really need to scale. Yes, you can use a cloud SQL service to get rid of DB scaling issues, but boy is it expensive!

Finally, there are use cases where SQL-based solutions just won’t do. For instance, if you’re creating a product like Canva, where the user can create arbitrarily complex designs and be able to edit them later, good luck with a relational database!

When not to use MongoDB

The complete lack of schema that MongoDB provides can work as a tar pit for those who don’t know what they’re doing. Data mismatch, dead data, empty fields that should not be empty — all this and much more is possible. MongoDB is essentially a “dumb” data store, and if you choose it, the application code has to take a lot of responsibility for maintaining data integrity.

If you are a developer, then you will find this useful.

RethinkDB

As its name goes, RethinkDB “rethinks” the idea and capabilities of a database when it comes to real-time apps.

When a database gets updated, there’s no way for the application to know. The accepted approach is for the app to fire off a notification as soon as there’s an update, which gets pushed to the front-end through a complex bridge (PHP -> Redis -> Node -> Socket.io is one example).

But what if the updates could be pushed directly from the database to the front-end?!

Yes, that’s the promise of RethinkDB. So if you’re on to making a true real-time application (game, marketplace, analytics, etc.), Rethink DB is worth a look.

Redis

When it comes to databases, it’s almost too easy to overlook the existence of Redis. That’s because Redis is an in-memory database and is mostly used in support functions like caching.

Learning this database is a ten-minute job (literally!), and it’s a simple key-value store that stores strings with an expiry time (which can be set to infinity, of course). What Redis loses in features it makes up for in utility and performance. Since it lives entirely in RAM, reads and writes are insanely fast (a few hundred thousand operations per second aren’t unheard of).

Redis also has a sophisticated pub-sub system, which makes this “database” twice as attractive.

In other words, if you have a project that could benefit from caching or has some distributed components, Redis is the first choice.

SQLite

Yes, I promised that we were done with relational databases, but SQLite is too cute to ignore.

SQLite is a lightweight C library that provided a relational database storage engine. Everything in this database lives in a single file (with a .sqlite extension) that you can put anywhere in your filesystem. And that’s all you need to use it! Yes, no “server” software to install, and no service to connect to.

Useful features

Even though SQLite is a lightweight alternative to a database like MySQL, it packs quite a punch. Some of its shocking features are:

  • Full support for transactions, with COMMIT, ROLLBACK, and BEGIN.
  • Support for 32,000 columns per table
  • JSON support
  • 64-way JOIN support
  • Subqueries, full-text search, etc.
  • Maximum database size of 140 terabytes!
  • Maximum row size of 1 gigabyte!
  • 35% faster than file I/O

When to use SQLite

SQLite is an extremely specialized database that focuses on a no-nonsense, get-shit-done approach. If your app is relatively simple and you don’t want the hassle of a full-blown database, SQLite is a serious candidate. It makes particular sense for small- to mid-sized CMSs and demo applications.

When not to use SQLite

While impressive, SQLite doesn’t cover all the features of standard SQL or your favorite database engine. Clustering, stored procedures, and scripting extensions are missing. Also, there’s no client to connect, query and explore the database. Finally, as the application size grows, performance will degrade.

Cassandra

While many proclaim that the end is near for Java, every once in a while the community drops a bombshell and silences the critics. Cassandra is one such example.

Cassandra belongs to what’s known as the “columnar” family of databases. The storage abstraction in Cassandra is a column rather than a row. The idea here is to store all the data in a column physically together on the disk, minimizing seek time.

Unique features

Cassandra was designed with a specific use case in mind — dealing with write-heavy loads and zero tolerance for downtime. These become its unique selling points.

  • Extremely fast write performance. Cassandra is arguably the fastest database out there when it comes to handling heavy write loads.
  • Linear scalability. That is, you can keep adding as many nodes to a cluster as you want, and there will be a zero increase in complexity or brittleness of the cluster.
  • Unmatched partition tolerance. That is, even if multiple nodes in a Cassandra cluster go down, the database is designed to keep performing without loss of integrity.
  • Static typing

When to use Cassandra

Logging and analytics are two of the best use cases for Cassandra. But that’s not all — the sweet spot is when you need to handle really large sizes of data (Apple has a Cassandra deployment handling 400+ petabytes of data while at Netflix it handles 1 trillion requests a day) with literally zero downtime. High availability is one of the hallmarks of Cassandra.

When not to use Cassandra

The column storage scheme of Cassandra also has its disadvantages. The data model is rather flat, and if you need aggregations, then Cassandra falls short. Moreover, it achieves high availability by sacrificing consistency (remember the CAP theorem for distributed systems), which makes it less suitable for systems where high read accuracy is needed.

Timescale

New developments demand new types of databases, and the Internet of Things (IoT) is one such phenomenon. One of the best open source databases for that is Timescale.

The timescale is a type of what’s called a “time series” database. It’s different from a traditional database in that time is the primary axis of concern, and the analytics and visualization of massive data sets is a top priority. Time series databases rarely see a change in existing data; an example is temperature readings sent by a sensor in a greenhouse — new data keeps getting accumulated every second, which is of interest for analytics and reporting.

Why not only use a traditional database with a timestamp field, then? Well, there are two main reasons for that:

  • General-purpose databases are not optimized to work with time-based data. For the same amounts of data, a general-purpose database will be much slower.
  • The database needs to handle massive amounts of data as new data keeps flowing in and removing data or changing schema; later on, is not an option.

Unique features

Timescale DB has some exciting features that set it apart from other databases in the same category:

  • It’s built on PostgreSQL, arguably the best open source relational database out there. If your project is already running PostgreSQL, Timescale will slide right in.
  • Querying is done through the familiar SQL syntax, reducing the learning curve.
  • Ridiculously fast write speeds — millions of inserts per second aren’t unheard of.
  • Billions of rows or petabytes of data — it’s no big deal for Timescale.
  • True flexibility with schema — choose from relational or schemaless as per your needs.

It doesn’t make much sense to talk about when to use or not use Timescale DB. If IoT is your domain, or you’re after similar database characteristics, Timescale is worth a look.

CouchDB

CouchDB is a neat little database solution that sits quietly in a corner and has a small but dedicated following. It was created to deal with the problems of a network loss and eventual resolution of data, which happens to be a problem so messy that developers would instead switch jobs than deal with it.

Essentially, you can think of a CouchDB cluster as a distributed collection of nodes large and small, some of which are expected to be offline. As soon as a node comes online, it sends data back to the cluster, which is slowly and carefully digested, eventually becoming available to the entire cluster.

Unique features

CouchDB is something of a unique breed when it comes to databases.

  • Offline-first data syncing capabilities
  • Specialized versions for mobile and web browsers (PouchDB, CouchDB Lite, etc.)
  • Crash-resistant, battle-tested reliability
  • Easy clustering with redundant data storage

When to use CouchDB

CouchDB was built for offline tolerance and remains unmatched in this regard. A typical use case is mobile apps where a portion of your data resides on a CouchDB instance on the user’s phone (because that is where it was generated). The exciting thing is that you cannot rely on the user’s device to be connected all the time, which means the database has to be opportunistic and be ready to resolve conflicting updates later on. This is achieved using the impressive Couch Replication Protocol.

When not to use CouchDB

Trying to use CouchDB outside of its intended use case will lead to disaster. It uses way much more storage than anything else out there, simply because it needs to maintain redundant copies of data and conflict resolution results. As a result, write speeds are also painfully slow. Finally, CouchDB is not suitable as a general purpose schema engine, as it doesn’t play well with schema changes.

Conclusion

I had to leave out many interesting candidates like Riak, so this list is to be taken as a guide rather than a commandment. I hope I was able to achieve my goal with this article — present not just a collection of database recommendations, but also briefly discuss where and how they need to be used (and avoided!).

If you are curious to learn database then check out Udemy for some of the brilliant online courses.

Source