Hacker Newsnew | past | comments | ask | show | jobs | submit | bayindirh's commentslogin

You can always go through cachegrind or perf and see what happens with your code.

I managed to reach practical IPC limits of the hardware I was running on, and while I could theoretically make prefetcher happier with some matrix reordering, looking back, I'm not sure how much performance it provided since the FPU was already saturated at that point.


Actually, C, FORTRAN and C++ are friendly to memory bandwidth, written correctly.

C++ is better than FORTRAN, because while it's being still developed and quite fast doing other things that core FORTRAN is good at is hard. At the end of the day, it computes and works well with MPI. That's mostly all.

C++ is better than C, because it can accommodate C code inside and has much more convenience functions and libraries around and modern C++ can be written more concisely than C, with minimal or no added overhead.

Also, all three languages are studied so well that advanced programmers can look a piece of code and say that "I can fix that into the cache, that'll work, that's fine".

"More modern" programming languages really solve no urgent problems in HPC space and current code works quite well there.

Reported from another HPC datacenter somewhere in the universe.


I suppose that most HPC problems are embarrassingly parallel™, and have very little if any mutable shared state?

I'd say that the opposite is more often the reality, which is why HPC systems tend to have high-bandwidth, low-latency networks.

High bandwidth may mean the need to consult some very large but immutable data structure. As a trivial example, multiplying two matrices requires accessing each matrix fully multiple times over, but neither of them is altered in the process, so it can safely be done in parallel. Recording the result of a (naive) matrix multiplication can also be done without programmatic coordination, because each element is only updated once, independently from others.

This is very unlike, say, a database engine, where mutations occur all the time and may come from multiple threads.

Rust specifically makes it hard to impossible to clobber shared mutable state, e.g. to produce a dangling pointer. But this is not a problem that our matrix-multiplication example would have, so it won't benefit from being implemented in Rust. Maybe this applies to more classes of HPC problems.


The HPC infrastructure is not like you're used to using. It is very high bandwidth but latency is dependent on where your data lives. There's a lot more layers that complicate things and each layer has a very different I/O speed

https://extremecomputingtraining.anl.gov/sites/atpesc/files/...

Also how to handle the data can be very different. Just see how libraries like this work. They take advantage of those burst buffers and try to minimize what's being pulled from storage. Though there's a lot of memory management in the code people write to do all this complex stuff you need so that you aren't waiting around for disks... or worse... tape

https://adios-io.org/applications/


On the contrary. However, they tend to manually manage memory rather than outsourcing it to a language runtime or a distributed key-value store.

Long live Linux!

Why not can’t you (as in Cal.com) spend that amount of money and find vulnerabilities yourself?

You can keep the untested branch closed if you want to go with “cathedral” model, even.


The sad thing is, I started with Android believing it more than Apple's ecosystem, and after my first Android phone, I quickly jumped ship to iPhone.

My parents use Android devices and I manage them. With every iteration, Apple went to the way of PalmOS' refined flows as much as possible, and Android became what Windows CE aspired to be. A complex multi-layer wafer you can't understand which layer comes from where, and it's all different and non-standard between vendors.

Not the least, Android is mobile land of mini tools you have to install to be able to have a power-user friendly platform. Reminds me my old Windows days where I had to install utilities half day to be able to make the installation usable the way I want.


I have recently upgraded to an iPhone 17 Pro Max from iPhone X (yes, I buy once a decade or so), and I also take photos with a A7-III.

The latest iPhone takes shockingly good photos. It's not a full frame mirrorless by any stretch, but it's really in another league when it comes to mobile photography.


Even shutting down HAL9000 was easier than this, and I'm half joking.

I named my phone HAL9000 and when I read this I immediately thought, "Well yeah I just turn it off"

> Arguably they should even when not in that mode, but it'll churn files repeatedly as you stream files in and out of local storage with the cloud provider.

When you have a couple terabytes of data in that drive, is it acceptable to cycle all that data and use all that bandwidth and wear down your SSD at the same time?

Also, high number of small files is a problem for these services. I have a large font collection in my cloud account and oh boy, if I want to sync that thing, the whole thing proverbially overheats from all the queries it's sending.


Reading your comments, it sounds like you are arguing it is impossible to backup files in Dropbox in any reasonable way, and therefore nobody should backup their cloud files. I know you haven’t technically said that, but that’s what it sounds like.

I assume you don’t think that, so I’m curious, what would you propose positively?


> I know you haven’t technically said that, but that’s what it sounds like.

Yes, I didn't technically said that.

> It sounds like you are arguing it is impossible to backup files in Dropbox in any reasonable way, and therefore nobody should backup their cloud files.

I don't argue neither, either.

What I said is with "on demand file download", traditional backup software faces a hard problem. However, there are better ways to do that, primary candidate being rclone.

You can register a new application ID for your rclone installation for your Google Drive and Dropbox accounts, and use rclone as a very efficient, rsync-like tool to backup your cloud storage. That's what I do.

I'm currently backing up my cloud storages to a local TrueNAS installation. rclone automatically hash-checks everything and downloads the changed ones. If you can mount Backblaze via FUSE or something similar, you can use rclone as an intelligent MITM agent to smartly pull from cloud and push to Backblaze.

Also, using RESTIC or Borg as a backup container is a good idea since they can deduplicate and/or only store the differences between the snapshots, saving tons of space in the process, plus encrypting things for good measure.


This. You should not try to backup your local cache of cloud files as if those were your local files. Use a tool that talks to the cloud storage directly.

Use tools with straightforward, predictable semantics, like rclone, or synching, or restic/Borg. (Deduplication rules, too.)


My understanding of Backblaze Computer Backup is it is not a general purpose, network accessible filesystem.[0] If you want to use another tool to backup specific files, you'd use their B2 object storage platform.[1] It has an S3 compatible API you can interact with, Computer Backup does not.

But generally speaking, I'd agree with your sentiment.

[0]: https://www.backblaze.com/computer-backup/docs/supported-bac...

[1]: https://www.backblaze.com/docs/cloud-storage-about-backblaze...


But if the files are only on the remote storage and not local, chances are they haven't been modified recently, so it shouldn't download them fully, just check the metadata cache for size / modification time and let them be if they didn't change.

So, in practice, you shouldn't have to download the whole remote drive when you do an incremental backup.


You can't trust size and modification time all the time, though mdate is a better indicator, it's not foolprooof. The only reliable way will be checksumming.

Interestingly, rclone supports that on many providers, but to be able to backblaze support that, it needs to integrate rclone, connect to the providers via that channel and request checks, which is messy, complicated, and computationally expensive. Even if we consider that you won't be hitting API rate limits on the cloud provider.


If you can’t trust modification time you are doing something so unusual that you probably need to be handling your backups privately anyway.

I don't think so.

Sometimes modification time of a file which is not downloaded on computer A, but modified by computer B is not reflected immediately to computer A.

Henceforth, backup software running on computer A will think that the file has not been modified. This is a known problem in file synchronization. Also, some applications modifying the files revert or protect the mtime of the file for reasons. They are rare, but they're there.


Then do it in memory, assuming those services allow you to read the files like that. It sounds like they do based on your other comments.

The problem is, downloading files and disk management is not in your control, that part is managed by the cloud client (dropbox, google drive, et. al) transparently. The application accessing the file is just waiting akin to waiting for a disk spin up.

The filesystem is a black box for these software since they don't know where a file resides. If you want control, you need to talk with every party, incl. the cloud provider, a-la rclone style.


> Unless it does something very weird it won't trigger all those files to download at the same time. That shouldn't be a worry.

The moment you call read() (or fopen() or your favorite function), the download will be triggered. It's a hook sitting between you and the file. You can't ignore it.

The only way to bypass it is to remount it over rclone or something and use "ls" and "lsd" functions to query filenames. Otherwise it'll download, and it's how it's expected to work.


Why would it use either of those on all the files at once? It should only be opening enough files to fill the upload buffer.

I think you might be confusing Backblaze reading files and how Dropbox/OneDrive/Nextcloud/etc. work. NC doesn't enable this by default (I don't think), but Windows calls it virtual file support. There is no avoiding filling the upload buffer, because Backblaze has zero control over how Dropbox downloads files. When Backblaze requests that a file be opened and read, Windows will ask Dropbox or whatever to open the file for it, and to read it. How that is done is up to whatever handles the virtual files. To Backblaze, your Dropbox folder is a normal directory with all that that entails, so Backblaze thinks that it can just zip through the directory and it'll read data from disk, even though that isn't really what's happening. I had to exclude my Nextcloud directory from my Duplicati backups for precisely this reason -- my Nextcloud is hosted on my server, and Duplicati was sending it so many requests it would cause my server to start sending back error 500s.

And no, my server isn't behind cloudflare, primarily because I don't have $200 to throw at them to allow me to proxy arbitrary TCP/UDP ports through their network, and I don't know how to tell CF "Hey, only proxy this traffick but let me handle everything else" (assuming that's even possible given that the usual flow is to put your entire domain behind them).


No, I'm not confusing anything.

Dropbox and onedrive can handle backblaze zipping through and opening many files. The risk is getting too many gigabytes at once, but that shouldn't happen because backblaze should only open enough for immediate upload. If it does happen it's very easily fixed.

If it overloads nextcloud by hitting too many files too fast, that's a legitimate issue but it's not what OP was worried about.


The issue you’re missing is that the abstraction Dropbox/OneDrive/etc provide is not that of an NFS. When an application triggers the download of a file, it hydrates the file to the local file system and keeps it there. So if Backblaze triggers the download of a TB of files, it will consume a TB of local file system space (which may not even exist).

It won't keep it permanently. That would break under normal use.

Keeping recent files will work fine with a program that goes through them as fast as it can upload (which is not super fast).


It does keep them permanently. Dropbox is not a NAS and does not pretend to be one.

> When you open an online-only file from the Dropbox folder on your computer, it will automatically download and become available offline. This means you’ll need to have enough hard drive space for the file to download before you can open it. You can change it back to online-only by following the instructions below.

https://help.dropbox.com/sync/make-files-online-only

Same exact behavior for OneDrive, though it apparently does have a Windows integration to eventually migrate unused files back to online-only if enabled.

> When you open an online-only file, it downloads to your device and becomes a locally available file. You can open a locally available file anytime, even without Internet access. If you need more space, you can change the file back to online only. Just right-click the file and select "Free up space."

https://support.microsoft.com/en-us/office/save-disk-space-w...


Maybe it'll, maybe it won't, but it'll cycle all files in the drive and will stress everything from your cloud provider to Backblaze, incl. everything in between; software and hardware-wise.

That sounds very acceptable to get those files backed up.

It shouldn't stress things to spend a couple weeks relaying a terabyte in small chunks. The most likely strain is on my upload bandwidth and yeah that's the cost of cloud backup, more ISPs need to improve upload.


I mean, cycling a couple of terabytes of data over a 512GB drive is at least full 4 writes, which is too much for that kind of thing.

> more ISPs need to improve upload.

I was yelling the same things to the void for the longest time, then I had a brilliant idea of reading the technical specs of the technology coming to my home.

Lo and behold, the numbers I got were the technical limits of the technology that I had at home (PON for the time being), and going higher would need a very large and expensive rewiring with new hardware and technology.


4 writes out of what, 3000? For something you'll need to do once or twice ever? It's fine. You might not even eat your whole Drive Write Per Day quota for the upload duration, let alone the entire month.

> the technical limits of the technology that I had at home (PON for the time being)

Isn't that usually symmetrical? Is yours not?


> 4 writes out of what, 3000?

Depends on your device capacity and how much is in actual use. Wear leveling things also wear things while it moves things around.

> For something you'll need to do once or twice ever?

I don't know you, but my cloud storage is living, and even if it's not living, if the software can't smartly ignore files, it'll pull everything in, compare and pass without uploading, causing churns in every backup cycle.

> Isn't that usually symmetrical? Is yours not?

GPON (Gigabit PON) is asymmetric. Theoretical limits is 2.4Gbps down, 1.2Gbps up. I have 1000Mbit/75Mbit at home.


> I don't know you, but my cloud storage is living

But you're probably changing less than 1% each day. And new changes are likely already in the cache, no need to download them.

> if the software can't smartly ignore files, it'll

Backblaze checks the modification date.

> GPON (Gigabit PON) is asymmetric. Theoretical limits is 2.4Gbps down, 1.2Gbps up. I have 1000Mbit/75Mbit at home.

2:1 is fine. If you're getting worse than 10:1 then that does sound like your ISP failed you?


How do you know how often those files need to be backed up without reading them? Timestamps and sizes are not reliable, only content hashes. How do you get a content hash? You read the file.

If timestamps aren’t reliable, you fall way outside the user that can trust a third party backup provider. Name a time when modification timestamp fails but a cloud provider will catch the need to download the file.

Backblaze already trusts the modification date.

Why would it do that more than once unless you are modifying 4TB of data every day, in which case you are causing the problem.

I don't know how your client works, but reading metadata (e.g. requesting size) off any file causes some cloud clients to download it completely.

Of course I'm not modifying 4TB on a cloud drive, every day.


Can you name such a client? That sounds like a terrible experience.

The font is dark enough, yet the weight is too light. Hairline or ultrathin or something. It's eye straining.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: